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TEACHERS' MARKS 



INTRODUCTION 

Two distinct questions are involved in the problem of assign- 
ing marks to pupils. The first concerns the average standard 
of achievement which should be expected of normal children of 
a given age and grade, and the second concerns the distribution 
of the ability within the normal group around that standard. 
The first is a question for the school administrator primarily, 
while the second must be solved by the psychologist. The inter- 
relation between the two questions is obvious and suggests the 
need for psychology in school administration and the need for 
keen insight into the problems of administration on the part of 
the educational psychologist. However, I shall not examine the 
second question beyond what its relation with the first demands. 

The answer to the first question lies in our discovery of some 
method of defining merit in school work. We have long de- 
pended upon the examination paper and still do depend upon it 
almost universally. Out of a growing recognition of the inade- 
quacy of the examination as at present administered, there has 
developed in this country a disposition to depend more and more 
upon the individual teacher's notion of what is proper to expect 
in the way of student achievement. This is resulting in wide 
differences in demands because the standards of teachers are 
far from uniform. The standardizing influence of examinations 
is being removed without anything being put in its place. For 
example, the custom has come to be quite universal throughout 
the middle and western states of having admission to college 
based, not upon examinations as was the custom not so many 
years ago, but upon high school accreditment, which is an ar- 
rangement between the high school and college whereby the 
graduates of the high school are admitted to college without 
examination, provided certain standards in equipment, instruc- 
tion, course of study, etc., are maintained by the high school. 
To be sure, there is usually a representative of the college desig- 
nated as high school inspector whose duty it is to keep the high 
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schools up to a fairly uniform standard of scholarship, but anyone 
familiar with the work of inspectors in general will not claim 
that marked success attends their efforts. Standardization of 
equipment, teachers' salaries and experience, number of recita- 
tions, etc., does little more than begin to standardize the require- 
ment for a passing achievement in a given subject of study. 
Work which satisfies a teacher of chemistry in one accredited 
school would be considered far from satisfactory by another 
teacher of chemistry in another accredited school. This absence 
of uniformity of requirement seems fairly spread over the whole 
field of education. With the increasing emphasis we are plac- 
ing in this country upon the teacher's individuality, the situation 
is likely to grow worse unless some measures for standardization 
can be put into operation. 

It is with this question of standards among teachers that the 
present study is concerned. The effort is here made to point 
out the extent of variability among teachers in rating work of 
equal merit. From this the need for practical definitions of 
standard achievements may be appreciated. When we appre- 
ciate the need for defining achievement we are ready to consider 
some of the tests and scales which have been devised for the 
purpose of making possible these definitions. 

The Problem 

The problem undertaken in this study, then, is twofold: first, 
to set forth the situation as it exists with respect to teachers' 
marks, and, second, to examine certain standard tests and scales 
to determine their effectiveness in improving this situation. 

The Material and Method 

In the first part of the study which undertakes to set forth the 
existing conditions with respect to the variability of standards 
among teachers, my main task has been to summarize and eval- 
uate the work of former students of the subject. I have found 
it necessary in many cases to give the results of earlier studies 
in the form of tables which the authors have used to summarize 
their findings. 

In the second part of the study I have undertaken to try out 
certain recently devised tests and scales as instruments for re- 
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during the variability among standards of marking which the 
present situation reveals. Comparison between the variabili- 
ties, found without the scale and with it, forms the basis of the 
study. There is a definite limitation to be kept in mind through- 
out this comparison. In every case the results recorded are ob- 
tained from persons thoroughly experienced in the use of the 
common systems of marking and completely unpracticed in the 
use of the derived scales. Until some future study reveals the 
effects of practice in the use of the derived scales, no final 
judgment can be made as to the ultimate service of the scales 
in establishing standards. The present study contains a few 
evidences tending to show that the practice effect is rapid and 
great. 



STANDARDS OF MARKING IN ELEMENTARY SCHOOLS 

The two most significant studies in this field support the gen- 
erally accepted notion that marks mean very different things to 
different teachers. The investigation by Ralph E. Carter 1 in 
Milwaukee, Wis., in 1911, where a uniform system of marking 
prevails throughout the elementary schools, revealed some strik- 
ing facts. He considered only classes which completed the 
eighth grade in 1907, thereby assuring uniform instructions about 
grading and uniform curricula among the several schools. The 
following variation was found in the marks given by three schools: 
Of all the marks given in arithmetic, 

In School A, | were below 79, and I above 84. 

In School B, ■§ were below 71, and f above 78. 

In School C, § were below 82, and | above 88. 

Two thirds of School B fall within the range of the lowest 
third of School A, while two thirds of School C fall within the 
range of the highest third of School B with a margin of four 
points to spare. 

As an indication of how much real difference in ability these 
differences in marks indicated the records of the members from 
these schools were traced in the high schools. It was necessary to 
determine first what proportion of the members from the poorer 
and better sections of the three classes entered high school. It 
was found that the school grading lowest had sent a larger propor- 
tion of its poorer members to high school than the school grading 
highest. Nevertheless, when all the algebra marks of the mem- 
bers from the three schools were ranked together, it was found 
that "a greater percentage of School B excelled in maintaining 
their original rank or increasing it. In fact there was a com- 
plete reversal of things from what the absolute marks alone 
might indicate." 

In Iowa City, Iowa, Walter R. Miles 2 made a similar study of 
the marks of pupils entering the high school from the elementary 
schools of that city. Using the cases of all pupils whose scholastic 

1 Ralph E. Carter, Correlation of Elementary Schools and High Schools, 
Elementary School Teacher, 12:109-118. 

2 Walter R. Miles, Comparison of Elementary and High School Grades, 
Univ. of Iowa, Studies in Education, Vol. I, No. 1. 

5 
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records were complete for the last four years of their elementary 
school and at least two years of their high school careers, he cov- 
ered a period of twelve years, and obtained 106 cases. To ob- 
tain a pupil's rank, all of his elementary school marks were av- 
eraged for his elementary school rank and all of his high school 
marks were averaged for his high school rank. By this means 
the inequalities of rating pointed out by Carter in Milwaukee 
were largely balanced from year to year and subject to subject 
since, in every case, the rank of a pupil represented the combined 
judgment of several teachers. The average of the elementary 
school marks thus determined was found to be 89.15 while the 
average of the high school marks was 82.49. By subjects, the 
averages varied as follows: In elementary school, from 87 to 
91.33; in high school, from 79.94 to 86.92. 

A list of coefficients is given representing the correlation be- 
tween the marks given in one department or school and another. 
The average of the fifteen coefficients of correlation between one 
elementary school subject and another is .567; the average of the 
ten between one high school subject and another is .618; the 
average of the eighteen between elementary school subjects and 
high school subjects is .446. The highest coefficient of the list 
is, naturally, that between the average of all elementary school 
grades, and the average of all high school grades. It is .71. 
When we remember that the marks used in these calculations 
were the average of several teachers' ratings in every case, the 
coefficients do not seem very high. It appears that the greater 
the number of marks which enter into the averages, the higher 
the correlation. We shall consider this point a little later. 

There are three other studies, the first by W. F. Dearborn 1 at 
Madison, Wis., the second by H. I. Miller 2 at Kansas City, Kan., 
and the third by F. W. Johnson 3 at Chicago, which point to the 
same absence of standards among teachers in elementary schools. 
The data which they furnish, however, may be accounted for 
in large part by other factors than variations in standards among 

1 W. P. Dearborn, School and University Grades, Univ. of Wisconsin 
Bulletin, No. 368, 1910. 

! H. I. Miller, A Comparative Study of Grades of Pupils from Different Ele- 
mentary Schools in Subjects of the First-Year High School, Elementary School 
Teacher, 11:161-175. 

* F. W. Johnson, A Comparative Study of Grades of Pupils from Different 
Elementary Schools, in Subjects of the First-Year High School, Elementary 
School Teacher, 11:63-68. 
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teachers, and so it seems best to omit here any detailed statement 
regarding them. 

To supplement these rather meager data pointing to the un- 
reliability of the marks given by elementary teachers, I made 
in December, 1913, a study of the cumulative record cards for 
the Hackensack, N. J., schools. Several considerations prompted 
me to use these schools for this investigation. Besides the effec- 
tive way in which the records are kept, and the cordial spirit with 
which Superintendent Stark and his corps of teachers welcome an 
investigator, the fact that departmental teaching is done in the 
seventh and eighth grades seemed very significant for my pur- 
poses. There are four ward schools which send their pupils ,at 
the completion of the sixth grade to this common seventh grade. 
It seemed to me important to determine how far the pupils from 
each school maintained their relative positions in the common 
seventh grade classes. If there should be found a difference in 
the amount of increase or decrease in marks from sixth to seventh 
grade among the four school groups as wholes, it would be pos- 
sible to measure with some degree of security the difference in 
standard between a given mark in one sixth grade and the same 
mark in another sixth grade. 

One fact must be taken into account in estimating the worth 
of such a measure. The seventh grade pupils are classified in 
three "courses": academic, commercial, and manual arts. The 
work is not identical in these courses, and, in part, the subjects — 
language for example — are not taught by the same teacher in all 
the courses. Nor do the representatives from the several sixth 
grades distribute themselves similarly among the three courses. 
Hence if different standards are held by the different seventh 
grade teachers, it may affect the results in some degree. Since 
it is impossible to calculate the amount of this influence, how- 
ever, I have disregarded it in the figures, and have assumed that 
the seventh grade marks are a common standard by which to 
measure the variation among the four sixth grades. 

The marks recorded on the cards in Hackensack are letters 
as follows: "E," "G," "F," and "P," for excellent, good, fair, 
and poor, respectively. The plus or minus after each letter is 
used, thus making twelve steps from the poorest to the best. 
For purposes of this study I have called "P-," 1; "P," 2;"P+," 
3; "F-," 4; and so on to "E+," as 12. The smallest difference 
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recognized is one of these steps and must be carefully distin- 
guished from those differences recognized when the common 
basis of 100 is used. 

The data were gathered for all the pupils who made the two 
promotions in succession as follows: first group, from 6A in 
June, 1912, arid from 7B in January, 1913; second group from 
6A in January, 1913, and from 7B in June, 1913. Only the term's 
grade appears on the card, one mark for each subject; hence we 
have a composite mark in each subject derived from whatever 
sources, daily recitations, tests, etc., the teachers thought 
fit to determine the pupil's standing at promotion time. The 
marks for the six subjects, language, penmanship, history, 
geography, arithmetic, and spelling were used. 

The following simple plan was used for arranging the data: 

SCHOOL A 



Pupils' 


Names 


Language 






Penmanship 






6A 


7B 


Gain 


Loss 


6A 


7B 


Gain 


Loss 


Adams 


9 


7 




2 


5 


6 


1 






4 


6 


2 




7 


7 













From the tabulations thus made the average gain or loss of 
the pupils from each of the four schools was determined by sim- 
ply dividing the algebraic sum of the gains and losses by the 
number of the pupils from the school. The median grades or 
marks given in both the sixth and seventh grades were also de- 
termined. These two groups of data appear in the following 
tables, numbered 1, 2, 3, and 4. 

TABLE 1 
The Median Makes Received by the Pupils Promoted from the Four 
Sixth A Grades in June, 1912, and the Median Marks Received by 
the Same Four Groups op Children in January, 1913, When They 
Were Promoted from the Seventh B Grade 



School 


No. op 
Pupils 


Lang. 


Penman. 


Hist. 


Geoo. 


Arith. 


Spell. 


Medians 

0* 

Totals 


Gain 
ob Loss 




6A 


7B 


6A 


7B 


6A 


7B 


6A 


7B 


6A 


7B 


6A 


7B 


6A 


7B 


Totals 


A 


19 
29 
20 
20 


8 
9 
8 
6.5 


5 
8 
5 
8 


7 
8 
8 
7 


5 
8 
5 
7 


9 
8 
9 
7.5 


5 
7 
5 
6.5 


8 
7 
8 
8 


8 
8 
8 
8 


6 
8 
8 
8 


5 
6 
5 
6 


8 
9 
9 
9.5 


5 
10 

8 
11 


45 
49 
49 
43 


33 
45 
35.5 
42 


—12 


B 


— 4 


C 


—13 5 


D 


— l 
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TABLE 2 

Average Gains or Losses op Pupils, by Schools, in the Various Sub- 
jects Between the Marks Given in June, 1912, and to the Same 
Children in January, 1913. G Stands for Gain, L for Loss 



School 


No. o? 
Pupils 


Lang. 


Penman. 


Hist. 


Geoo. 


Abfth. 


Spell. 


Average 

of Total 

Gains and 

Losses 


A 

B 

C 

D 


19 
29 
20 
20 


1 2.32 
L .86 
L 2.19 
G 1.10 


L 3.06 
L .88 
L 3.93 
L .26 


L 2.73 
L 1.39 
L 4.16 
L .80 


L .10 
L .04 
L 1.40 
L .09 


L 1.89 
L 1.48 
L 1.30 
L 1.30 


L 1.09 
G .79 
£ .70 
G .39 


L 1.86 
L .64 
L 2.21 
L .16 



TABLE 3 

The Median Marks Received by the Pupils Promoted from the Four 
Sixth A Grades in January, 1913, and the Median Marks Received 
by the Same Four Groups of Children in June, 1913, When They 
Were Promoted from the Seventh B Grade 



School 


No. of 
Pupils 


Lang. 


Penman. 


Hist. 


Geoo. 


Akith. 


Spell. 


Medians 

of 
Totals 


Gain ob 
Loss in 
Totals 




8A 


7B 


6A 


7B 


6A 


IS 


6A 


7B 


6A 


IB 


6A 


7B 


6A 


7B\ 


A 


21 

17 

6 

20 


7 
6 
7 
8 


7 
6 

5.5 
8 


6 
none 
7.5 
6 


5 

5 

1 


7 
6 
8 
7 


7 
6 
6 
8 


7 
6 
8 
7 


6 

7 
4 
7 


6 
5 

6.5 
6.5 


6 
9 

5.5 
6 


9 
9 
8 
10 


9 
10 

6.5 
11 


45 
32 
42 

47 


38 
32 
31.5 

44.5 


7 


B 





C 


-10.5 


D 


- 2.5 











TABLE 4 

Average Gains or Losses of Pupils, by Schools, in the Various Sub- 
jects Between the Marks Given in January, 1913, and to the Same 
Children in June, 1913. G Stands for Gain, L for Loss 



School 


No. of 
Pupils 


Lang. 


Penman. 


Hist. 


Geog. 


Akith. 


Spell. 


Average 
of Total 
Gains and 

Losses 


A 

B 

C 

D 


21 

17 

6 

20 


L .57 

L .06 

L 1.16 

00 


h 2.09 

None 

L 2.00 

L .80 


L .38' 
G .29 
L 2.00 
G .90 


1 .86 
G .06 
L 2.33 
L .25 


L .48 
G .68 
h 1.16 
L .30 


G .33 
G .46 
L .83 
G .35 


L .68 
G .24 
L 1.58 
£ .08 



From the foregoing tables it is evident that there is no uni- 
formity among the standards used by the several teachers in 
giving marks to pupils. On the whole, the marks are consider- 
ably reduced from 6A to 7B. With few exceptions, the median 
marks given by the teacher of 6A in School C are higher than 
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the median marks given by the teacher of 6 A in School D, for 
example, although when these two groups of children are marked 
at the close of the following semester in 7B, it is found that the 
pupils from School D are almost uniformly higher than those 
of School C. Considering the average gains and losses in some 
of the more extreme cases, we find many striking variations. 
In the June, 1912-January, 1913 group, the difference in stand- 
ards as represented by the common 7B marks the succeeding 
semester, amounts to 3.25 steps in language between schools C 
and D, 3.27 steps in penmanship between the same schools, 3.35 
steps in history between the same schools, and an average dif- 
ference of 2.05 steps between the same schools. This means 
that for work which the teacher in School C would give a mark 
of "G" in language, penmanship, or history, the teacher in 
School D would give less than a mark of "F." And that a pupil 
whose monthly report card in School C had been a "G" card, 
on the whole, would be dismayed by receiving an "F+" when he 
moved to School D. It appears that School A marked somewhat 
lower than School C while School B marked higher than School 
D. 

A nearer approach to uniformity seems to prevail in the 
January, 1913- June, 1913 group, although wide variations appear 
there also. This greater uniformity may be partly accounted 
for by the fact that between June, 1912, and January, 1913, 
three of the sixth grade teachers who had given the marks re- 
corded in the first group were transferred and their places filled 
by teachers who had formerly been grammar grade teachers. 



STANDARDS OF MARKING IN HIGH SCHOOLS 

The first important study of this subject was made by F. W. 
Johnson, 1 principal of the University High School of the Univer- 
sity of Chicago. He investigated the marks given by the various 
departments in his school for the years 1907-08, and 1908-09 to 
determine the variation among them. His data are deserving 
of careful study. The plan of marking used in the University 
High School is as follows : F for failure, and D, C, B, A, for the 
successive ranks above failure. The percentages of the dif- 
ferent letters given by the several departments for 1908-09 are 
given in Table 5. 

TABLE 5 

Giving the Distributions or the Marks op the Several Departments 

of the University op Chicago High School 

(From Johnson) 

Total 

Department No. of % of F % of D % of C % of B % of A 

Mares 

Greek and Latin 886 10.6 16.1 31.8 23.5 17.9 

German 416 8.4 19.5 26.4 28.6 17.1 

French 475 10.9 18.7 33.0 28.0 9.3 

English 1514 15.5 21.7 32.8 23.4 6.5 

Mathematics 1466 14.5 25.2 27.6 21.1 11.5 

History 825 8.1 15.9 31.2 30.0 14.7 

Science 672 8.3 16.8 27.7 32.6 14.6 

Domestic Science 176 5.7 2.3 27.3 51.7 13.1 

Average 7297 11.5 18.9 30.6 27.0 12.0 

One cannot fail to notice from the table that the failures in 
English and mathematics far outnumber the failures in either 
history, science or German, while the A's are nearly three times 
as frequent in Greek and Latin as in English. 

When the marks of individual teachers are separated from 
these department groups, still wider variation appears. For 
example, the marks given by two different teachers in the same 
department are as follows: 

% of F % of D % of C % of B % of A 

First Teacher 8.0 16.0 47.5 22.0 7.5 

Second Teacher 4.5 6.0 24.0 30.5 36.0 

1 F. W. Johnson, A Study of High School Grades, School Review, 19: 13-24. 

11 
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The comparison of the marks of two teachers in different 
departments reveals even more striking variations : 

%opF %ofD %ofC %opB %ofA 

First Teacher 26.5 42.5 25.5 4.5 1.5 

Second Teacher 4.5 6.0 24.0 30.5 36.0 

It is conceivable that a set of conditions might prevail in which 
the above variations would be justified, at least in part. Johnson 
offers them, however, as examples of variation which have no 
justification. They are simply due to different standards held 
by different teachers. 

Franklin 0. Smith 1 at the University of Iowa used a different 
method for discovering the variability of standards of marking 
in use in the high schools of Iowa. He compared the high 
school marks and the college marks of 120 Liberal Arts students 
who graduated from the University of Iowa in 1910. The 
average of all high school marks was used as the student's high 
school standing, and the average of all university marks as his 
university standing. The correlation between the high school 
and university standings of these 120 students is represented by 
a Pearson coefficient of .53. This seems surprisingly low in 
view of Smith's use of the average. Much lower correlation ap- 
pears, however, when the separate subjects are compared with 
one another, or even with the same subject in the two schools. 
If the marks of individual teachers are fairly reliable, we should 
expect to find the correlations rather high between, say, math- 
ematics in high school and mathematics in university. The 
following portion of his list of coefficients is illuminating: 

English, high school and university 34 

Mathematics, high school and university 29 

History, high school and university 18 

Ancient Language, high school and university 43 

Modern Language, high school and university 28 

Science, high school and university 34 

Average 31 

Pettit 2 also found that the Pearson coefficient of correlation 
between average high school marks and average freshman col- 
lege marks to be .63, but found the average of the coefficients 

1 Franklin O. Smith, A Rational Basis for Determining Fitness for College 
Entrance, Univ. of Iowa, Studies in Education, Vol. 1, No. 3.' 

2 W. W. Pettit, A Comparative Study of New York High School and Colum- 
bia College Grades, Master's essay, Teachers College, 1912. 
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when, calculated by departments, high school English with col- 
lege English, mathematics with mathematics, etc., to be .49. 

It must be borne in mind that even these rather low coeffi- 
cients are derived from marks which are for the most part av- 
erages from several teachers' ratings. Teachers' marks in a 
single high school subject with those in the university would 
probably show even less relation. 

This plan of using the rank of a student in the next higher 
school as a guide for determining the correctness of rating in the 
lower school possesses such great possibilities for. forcing us to 
derive standards that it seems worth while to give two of Smith's 
tables indicating quintile changes and retentions from school to 
school. The averages are used to determine rank either high 
school or university, and then each fifth of the high school group 
is traced through the university, thus giving a simple indication 
of how consistently a given rank is maintained. These two 
tables follow as Table 6 and Table 7. 

TABLE 6 

Distribution by Quintiles in the University Rankings op Each Quin- / 

tile op the High School Rankings. General Averages op Marks 
Used in Each School Determine Rank 

High School Distribution in University by Per Cents 

1st Q. 2nd Q. 3d Q. 4th Q. 5th Q. 

1st Quintile 54.0 16.6 16.6 4.0 8.0 

2nd Quintile 25.0 29.0 16.6 12.5 16.6 

3rd Quintile 16.6 25.0 21.0 21.0 16.6 

4th Quintile 0. 25.0 25.0 33.3 16.6 

5th Quintile 4.0 4.0 21.0 29.0 42.0 

This table (from F. O. Smith's study, page 142) reads as follows: Of the 
lowest one fifth in high school rank, 54 per cent are found in the lowest one 
fifth in college rank; 16.6 per cent are found in the next fifth, and so on. 

TABLE 7 

Same as Table 6 except that instead of the general average of all univer- 
sity grades, the senior grades alone are used. (P. O. Smith, p. 145.) 

High School Distribution in University by Per Cents 

1st Q. 2nd Q. 3rd Q. 4th Q. 5th Q. 

1st Quintile 25.0 25.0 21.0 12.5 16.6 

2nd Quintile.. 29.0 25.0 25.0 16.6 4.0 

3rd Quintile 21.0 12.5 16.6 33.3 16.6 

4th Quintile 21.0 " 16.6 16.6 25.0 21.0 

5th Quintile 4.0 16.6 21.0 16.6 42.0 

If there were no tendency for students to maintain their 
previous rank when they went on to a higher school, all the per 
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cents in the tables reproduced would be just 20. Whether the 
amount of the tendency indicated is sufficient to satisfy us re- 
garding the reliability of teachers' marks, each reader must 
judge. The average of the ten figures representing the retention 
in the same quintile is 31.3. A chance distribution accounts 
for 64% of the retention of quintile rank. 

From these tables as well as from the coefficients of correla- 
tion (they are lower than those found by Dearborn, Miles, or 
Pettit) a fair inference can be made that the fact of absence of 
standards is a very large factor in producing this change of rank 
from one school to the next. Smith is using representatives 
from a large number of small schools instead of a small number 
of large schools. Perhaps one or two students from a school is 
the rule rather than fifty or more, and there is less chance for 
these isolated schools and teachers to approach uniformity of 
standards than there is in the cases of the large schools. Smith 
indicates in the last sentence of his study his appreciation of the 
need for standardization: "But when this is done (meaning the 
adoption of a rational method of the distribution of marks), 
there still remains the problem of standardizing the teacher's 
judgment." 

One of the most striking illustrations of how largely a matter of 
tradition the passing standard is, is afforded by the figures in 
Table 8, page 15, which were given to me by the principal of one 
of the New York City high schools. The difference between the 
percentage of pupils allowed to pass in the various subjects dur- 
ing the year previous to his becoming principal and the first 
year of his service was due almost wholly to the determination 
on his part to break up the tradition that a large percentage of 
each class ought to fail. 

Consider that in the large high school where this change took 
place this meant a reduction of the number of failures by nearly 
if not quite 500 a year. All this depended primarily upon the 
notion of one man. In the same high school during the last 
three years the average time of attendance of students to win 
graduation has been reduced by more than one year. There are 
undoubtedly many other factors entering into the remarkable 
changes, but the largest factor, certainly, is that the present 
principal and the former principal happen to have radically 
different standards. 



Standards in High Schools 



15 



TABLE 8 
Representing the Change in Percentage of Pupils Passed in the De- 
partments of a New York High School from One Year to the 
Next 

Department Term Percentage of Pupils 

Passed 

1910 1911 

Biology 1st 69 81 

2nd 79 83 

Algebra 1st 48 75 

2nd 61 80 

English 1st 77 86 

2nd 84 90 

French 1st 68 83 

2nd 66 85 

German 1st 65 83 

2nd 65 86 

Latin 1st 64 78 

2nd 72 82 

Average 68.2 82.7 

In an unpublished report of a study made in 1912 by Carter 
H. Alexander, at the time Professor of School Administration 
at the University of Missouri, some interesting facts concern- 
ing standards in the high schools of Missouri are brought out. 
The amount of variability among the standards employed in 
the thirty-one schools whose records were studied is best shown 
in the following table which I copy from the report: M stands 
for median, IQ for 25 percentile or that point below which 25 
per cent of the cases fall, ZQ for 75 percentile. 

TABLE 9 
Percentages by Schools of all Grades Issued by Each School in the 
Various Subjects Which Are Below Passing. The Medians and 
Limits of the Middle 50 Per Cent of the Distributions for 31 
Missouri High Schools, 1911-12, Accredited to the University 
(Alexander) 



1st Year 



M 



1Q 



3Q 



2nd Year 3rd Year 



M 



iQ 



M 



IQ 



4th Year 



M 



IQ 



English 

History 

Mathematics . 

Latin 

German 



8.7 
12.4 
13.2 
17.8 
16.0 



4.3 
8.0 
6.5 
8.4 
11.0 



17.6 
23.5 
22.0 
34.5 
20.0 



4.0 
10.0 
12.9 
11.8 
10.0 






4.6 
2.1 





12.5 
16.9 
23.9 
18.1 
11.4 



2.8 

5.5 

11.0 








11.0 











20.0 











16.3 











8.1 









5.0 
2.4 







This table reads as follows: In first year English, half the schools give 
more than 8.7 per cent of grades below passing, and half give less; one fourth 
of the schools give less than 4.5 per cent of grades below passing, and one 
fourth give more than 17.6 per cent below passing. 
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Every item in this table is significant. Surely no one can 
maintain that such wide variation regarding the number not 
passed in the several schools corresponds to a similar variation 
in student merit from school to school. Look, for example, at 
the third year column. One fourth of the schools fail none of 
their students in English, history, mathematics, or Latin, while 
another fourth of the schools fail more than 8 to 20 per cent of 
their students in the same branches. In spite of such differences, 
Dearborn 1 argues from a close correlation of rankings between 
averages in high school and averages in the university, that the 
plan of accrediting high schools forms a successful way of select- 
ing students. 

Mention may be made here of the variation among the aver- 
ages of marks given by the several departments, and in the 
percentages failed by departments in a representative high 
school for which figures are available. In Iowa City, Miles re- 
ports in the study referred to above, the following averages by 
departments,' and failure marks: 

Avbbage op Marks Per cent Failed 

Science 79.94 13 

Foreign Language. ... 81.53 14 

Mathematics 80.51 19 

English 83.39 9 

History 84.20 7 

Drawing 86 . 92 Not given 

In a certain large Illinois high school the principal reports 

percentages of failures as follows: 

Commercial 28 

Mathematics 23 

Modern Languages 22 

Ancient Languages 18 

History 16.5 

English 16 

Science 13.5 

Notice that twice as many are failed in commercial subjects 
as in science. The enrollment of the two departments concerned 
is 850 and 620 respectively, so we see that about ninety more 
pupils each year fail in commercial subjects than in science. 

In the two studies referred to above by W. F. Dearborn, at 
that time a member of the faculty of the University of Wis- 

1 W. F. Dearborn, Relative Standing of Pupils in the High School and in the 
University, Univ. of Wisconsin Bulletin, No. 312. 
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consin, we find not only a mine of information on the subject of 
grading but we find also the source of inspiration for three other 
most painstaking investigations in the same field. These three 
are a Master's essay written at Teachers College in 1911 by W. 
W. Pettit, 1 entitled "A Comparative Study of New York High 
School and Columbia College Grades." A Doctor's disserta- 
tion written at the University of Chicago in 1912, by John A. 
Clement, 2 entitled "Standardization of the Schools of Kansas"; 
and the third a recent number of the Educational Psychology 
Monographs, prepared at the University of Chicago by Clar- 
ence Truman Gray 3 and entitled, "Variations in the Grades of 
High School Pupils." The latter two studies were written under 
the direction of Dearborn at Chicago, but differ from Dearborn's 
study in one essential particular which will be described later. 
Pettit's study follows the same plan as Dearborn's and his conclu- 
sions are supposed to support the conclusions reached by Dearborn, 
whose chief purpose was to establish the superior merit of the 
plan of admission to college by accreditment over the plan of 
admission by examination. As supporting this purpose, the 
method of Dearborn contains two fallacies, it seems to me, which 
should be pointed out. I shall, therefore, defer consideration 
of the material in the later studies until after a criticism 
of the method of Dearborn and Pettit. While the latter author 
did not have the same purpose in view he used the same method 
as Dearborn to determine the " relative standing of pupils in 
the high school and college," and one of the chief points of 
significance which attaches to such information is its bearing upon 
the question of method of admission to college. Pettit cannot 
be held guilty, however, of the fallacies which are present in the 
Dearborn method. These two fallacies are, first, the use of 
averages to determine rank in both the high school and the 
university, when the results are to be applied to the question of 
admission to college by accreditment, and second, the failure 
to take account of differences in standards of rating by schools 
among the group of schools studied. 

The pointing out of these two fallacies is not a mere academic 
matter. If facts establish the conclusion that accreditment is 

1 Unpublished. 

2 University of Chicago Press. 

3 Warwick and York, Baltimore, Md. 
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"a successful means of selecting students for college," the 
corollary must follow that the standards of marking among the 
teachers in the high schools concerned are satisfactorily uniform, 
and there is not the urgent need for standardization in marking 
which is being claimed at present. Dearborn closes his study 
with the comment that his results "are in sharp contrast to those 
secured by the test of the entrance examinations at Columbia 
College." (Professor Thorndike's study is meant.) While I 
do not regard entrance examinations at all satisfactory as a means 
of selecting students for college, I do believe it can be shown that 
the fallacies above referred to are responsible for the "sharp 
contrast" which Dearborn establishes in favor of accreditment. 

In Dearborn's study, "The Relative Standing of Pupils in 
the High School and the University," the high school marks of 
all the representatives from the six cities which furnished the 
largest number of students in the College of Letters and Science 
at the University of Wisconsin from 1900 to 1905 were secured, 
as well as all the marks received by these same students in all 
their undergraduate classes in the University. This made a 
group of 472 students in all. Only three subgroups were con- 
sidered: Madison High School furnished 238; the three high 
schools of Milwaukee together furnished 139; four smaller 
high schools in the state furnished the remainder, 92. Because 
no closer differentiation into single high school groups is made, 
the second fallacy indicated above, namely, the disregard for 
difference in standards by schools, is not so clearly evident 
although demonstrable. On this account I shall use Pettit's 
data for the first and more complete illustration. 

Pettit studied the ratings of all the individuals who entered 
Columbia College from 1900 to 1910 from three high schools 
which we shall call A, B, and C. All the high school marks re- 
ceived in English, history, mathematics, science, Greek, Latin, 
and modern languages by each boy were averaged together to 
make his rank in the total high school group. Similarly the col- 
lege ranks were determined by averaging all marks received in 
a similar group of departments in college. Of the total group 
of 218 boys, School A furnished 53, School B, 88, and School C, 
77. 

Pettit followed the method used by Dearborn except in one 
particular. Where Dearborn used the quartile division of the 
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II 


60% 
23 


III 


6 


IV 





V 






Standards in High Schools 19 

group, that is, divided the whole range of ranks into four equal 
parts, Pettit used the quintile division, dividing the whole group 
into five equal parts. The individuals in each quartile or quin- 
tile were then traced through the higher school by quartile ot 
quintile. To illustrate I here reproduce one of Pettit's tables in- 
dicating the location in the sophomore rankings of the members 
of each quintile in the high school group of 218 boys: 

College Sophomore Quintiles 

II III IV V 

30% 7% 3% 0% 

20 23 17 17 

19 31 22 22 

17 11 37' 34 

10 31 24 34 

This table reads as follows: of the fifth ranking highest in high school, 60 
per cent are found in the highest fifth in college sophomore class; 30 per 
cent are found in the next to the highest fifth in rank, and so on. 

The two criticisms which I wish to make of the two studies 
are, then, specifically these : first, that data such as these give but 
slight ground for the conclusion that the system of accredited 
schools in vogue in the West is a successful method of selecting 
students for college, and second, that the amount of quartile and 
quintile change from high school to college, which is due directly 
to different standards prevailing in the several high schools com- 
posing the group, is far from negligible. 

In support of my first contention I hold that unless admission 
to college by accreditment means that all the high school grades 
of applicants for admission to college are averaged and admis- 
sion granted on the basis of this average, then the data submitted 
do not bear directly upon the success of the scheme of accredit- 
ment. So far as I am aware, no college secures its students that 
way. Instead, the common way is for the high school to satisfy 
the college that its work is up to the standard. In return the 
high school gets the assurance that any student whom it graduates 
will be admitted to college provided he has been passed by the 
school in a certain list of subjects prescribed by the college. The 
passing in these subjects is determined, as a rule, by the stand- 
ard of a single teacher in each subject. The variability of these 
standards from subject to subject, and school to school, has been 
abundantly shown, especially in Iowa and Missouri. In order 
successfully to maintain that a close correlation between high 
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school and college rank is a fair indication of the reliability of a 
teacher's mark in a particular subject, it would have to be shown 
that the ranking of a group from several schools in a single sub- 
ject, say, physics, correlates closely with the rankings of the same 
group in a similar subject in the university. This, neither Dear- 
born nor Pettit has shown. In fact there is every reason to 
believe from the other studies in this field that the nearer we ap- 
proach the marks of individual teachers the lower will be found 
the correlations between the rank in one group with the rank in 
the next. It will be recalled, for example, how much lower was 
the correlation between the marks of a department in high school 
with the marks of the same department in the college found by 
Smith in Iowa, than was the correlation between the average of 
all high school grades with the average of all college grades. This 
is very significant for our purposes. We should expect the aver- 
age of the estimates of a dozen or more teachers to come pretty 
close to the correct ranking of young people. We should expect 
the average estimates of a dozen teachers in the higher school to 
come pretty close to the same ranking. Nearly everything of 
importance about marks, however, attaches to a particular 
teacher's mark in a particular subject. Even admission to col- 
lege by the accreditment plan depends finally upon it. 

To indicate clearly how children's averages from one term to 
the next correlate more closely than do the marks which enter 
into the averages, I calculated Pearson coefficients to designate 
the correlation between one teacher's marks in a given subject 
and the succeeding teacher's marks in the same subject in the 
case of forty-two separate pairs of classes. I then averaged the 
marks given the same child in his six different subjects, and cor- 
related these averages in the case of the seven different groups 
of children who constituted the forty-two separate pairs of 
classes. 

The data gathered at Hackensack for the study of the varia- 
tion of teachers' marks in elementary schools were conveniently 
arranged for calculating the above coefficients. While it would 
have been desirable to use high school marks for this purpose, 
the same principle should hold throughout. 

If the teacher's mark is a reliable index of the child's ability ", 
then the marks of two successive teachers in the same subject 
should show a close correlation, closer in fact than the average 
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of many marks extending over a long time and including a variety 
of subjects. On the other hand, even if the teacher's mark is a 
very poor index of the child's ability, the average of thirty or 
forty such estimates is bound to approach fairly near that 
"general ability" which will be approached again by thirty or 
forty more estimates made in the higher school. Thus the rank 
obtained by the method of averages will be likely to hold fairly 
consistently from school to school, regardless of how inconsis- 
tently the teachers may mark from term to term. It is the teacher's 
mark which determines passing or failing, and it is passing or 
failing which determines college entrance by the accreditment 
plan. 

In Table 10, the correlations are indicated by Pearson 
coefficients. Great accuracy cannot be claimed for any in- 
dividual figures indicating relationships where the number of 
the cases in the distribution so correlated is so small, but the 
Pearson coefficients seem as accurate as any figure. It be- 
comes significant, however, when a large number of such coeffi- 
cients are secured, and their averages used as a measure of the 
correlation between one fact and another. It will be observed 
that in every case, the average of the coefficients obtained from 
the marks of the separate subjects is decidedly less than the 
coefficient obtained from the averages of the marks of the same 



TABLE 10 

Peabson Coefficients of Correlation Between the Marks Given in 
6A, June, 1912, and the Marks Given to the Same Children in the 
Same Subject in 7B, January, 1913, and the Same Below for the 
January, 1913-June, 1913 Group 



School 


Lang. 


Pen. 


Hist. 


Geog. 


Ahith. 


Spell. 


AVG. OF 

the 6 Co- 
effici- 
ents 


Coeffici- 
ent OF THE 
AVGS. OF 
THE 6 

Mabks 


Gain bt 
Method 
of Avg. 


A 


.20 

.51 

—.10 

.52 

.70 
.74 
.74 

.471 


.05 
.62 
.00 
.11 

—.07 

none 

.54 

.211 


.21 
.59 
.13 
.71 

.69 
.48 
.33 

.449 


.52 
.37 
.18 
.162 

.62 

.24 

—.04 

.364 


.63 

.20 

.03 

—.16 

.64 
.39 
.51 

.321 


.39 
.63 
.75 
.64 

.54 
.64 
.37 

.566 


.334 
.488 
.166 
.39 

.519 
.506 
.409 

.401 


.48 
.84 
.35 
.61 

.60 
.70 
.49 

.581 


.146 


B 


.352 


C 


.184 


D 


.22 


A 


.081 


B 


.194 


D 


.081 


Averages . . . 


.18 



Note: All of the above coefficients are plus except where minus is indicated. 

In the lower group. School C is omitted because the numbers in the class were too small to make valuable t 
figure indicating relationship between successive marks. 
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pupils. If now these averages were averaged with the averages 
from a half dozen other semesters, the coefficients would become 
rather high, even though the individual marks seem to be but poor 
indices of ability in the several subjects. 

From this table it is seen that the coefficient is increased .18 
by taking the average for a single semester, instead of individual 
teachers' marks by subjects. How much more would it be in- 
creased if the average for several years and many teachers were 
used? I submit then that the coefficients of correlation given 
by Smith, Miles, Dearborn and Pettit (.55, .71, .80, and .63, 
respectively) are poor evidence of the reliability of teachers' 
marks individually, and it is those marks, not averages, that 
count for accreditment. 

My second contention is that a far from negligible part of the 
changes in rank from high school to college is due directly to 
different standards of marking in use in the several schools 
making up the group. On this point Dearborn says, "as the 
average of the marks of pupils entering from one high school 
was often 1 or 2 per cent higher than that of another high school 
it was the practice at first to weight the marks of all pupils to a 
common average of all the high schools included in the group. 
It was found, however, from actual trial, that such weighting 
did not affect the general comparison sufficiently to be worth 
while. In some cases at least the differences in averages of the 
high schools may represent real differences in the efficiency of the 
pupils concerned. But however that may be, the weighting of 
marks did not affect appreciably the large units of comparison 
employed in this study, and has not for that reason been used 
in the final results." 

Pettit makes no mention of the point. He ranks his indi- 
viduals on the supposition that "80" in one school means the 
same as "80" in the others. 

The method employed to determine just how much of the 
change in rank from high school to college was due to the error 
in the above supposition was as follows: I calculated on the 
basis of the numbers from each school just what proportion of 
the changes should be attributed to each school group. I then 
determined by count how many quintile losses and quintile 
gains were actually made by each school group. If the losses 
were found to be proportionately too high, I assumed that the 
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standard of rating in the high school concerned had been lower 
than in the other schools. That is, an "80" had meant less in 
that school than in all the others. First, however, I compared 
the whole range of marks given each group in the high school 
and in the Freshman class in college. The data for this com- 
parison in the cases of the three high schools studied by Pettit 
are given below in Table 11: 

TABLE 7 11 
Giving the Median Mark and Quintile Division Points in Distribu- 
tions of Marks by High School Groups 

Quintile Division Points, High School Marks 
Median 1st 2nd 3rd 4th 

78 66.75 74.25 79.6 84.8 

80 72 77.5 80.75 84 

74 67 71.75 76 82.6 



School B . 
School A . 
School C. 



Giving the Median Mark and the Quintile Division Points in Dis- 
tributions op College Freshman Marks bt High School Groups 

Quintile Division Points, College Freshman Marks 
Median 1st 2nd 3rd 4th 

SchoolB 81 73 79.1 84.2 88.5 

School A 75.5 71 74 78 83 

School C 81 74.5 79 83 88 

From the high school marks School A boys seem the strongest 
students. When the marks of the first college year are taken, a 
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Fig. 1. (Data from Pettit.) Division points between quintiles in the 
distribution according to high school marks, and again in the distribution 
according to college freshman marks for the same pupils. 

change appears. The median for School A drops 4.5 points while 
the median for School B rises 3 points and the median for School 
C rises 7 points. Similar reversal of the situation occurs all along 
the distribution. This is represented in the diagram, Fig. 1. 
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Consider now the fate of the lowest and highest fifths of each 
high school group. Table 12 is prepared to indicate clearly 
the change which appears in the representation in the highest 
and lowest quintiles, by schools, between high school and fresh- 
man college ranks. 

TABLE 12 

Showing the Number from Each School Which Make up the Lowest 
and Highest Fifths of the High School and Freshman College 
Groups, and the Per Cent this Number is of the Number Which 
Each School Would Have if Representation in these Quintiles 
Were Proportionate to the Whole Number from that School 



School 


Lowest Quintile 
H. S. Gbodp 


Lowest Quintile 
College Geoot 


Highest Quintile 
H. S. Group 


Highest Quintile 
College Group 




No. 


% of quota 


No. 


% of quota 


No. 


% of quota 


No. 


% of quota 


B 


19 
3 
22 


108 
28 
143 


16 
18 
10 


91 
170 
65 


22 
13 
9 


125 
122 
58 


20 

7 

17 


114 


A 


66 


C 


111 







From Table 12 we notice that while School A furnishes only 
28 per cent of its quota to the lowest quintile, according to the 
ratings of high school teachers, it furnishes 170 per cent of its 
quota to the lowest quintile when the boys are rated in college. 
Oh the other hand, School C furnishes 143 per cent of its propor- 
tion to the lowest quintile in the high school ranking, but only 
65 per cent of its share in the college ranking. Similar reversals 
appear in the highest quintile individuals, except that the shift- 
ing is in the opposite direction. School B, on the other hand, 
seems to have used a standard more nearly like the college, and 
midway between School A and School C. 

In view of these obvious differences in standards among the 
three high schools it seemed worth while to calculate accurately 
just what portion of the shifting of position in the ranks between 
high school and college was due to these differences in standards. 
In doing this it seemed wise to use a method in the calculations 
a little more exact than that used in the study. When a dis- 
tribution, say, of fifty marks, is divided into quintiles, the tenth 
mark needs to change but one rank in order to fall in the next 
quintile, and thus register as one quintile change. The first 
individual in the distribution, on the other hand, has to change 
by as much as ten ranks in order to register one quintile change. 
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The method used here was devised to obviate that feature in 
counting quintile changes, because in using the expression, 
"dropped from the first to the second fifth of the class," we 
convey the idea of having shifted position by as much as one 
fifth of the number in the class. 

To explain the method most simply let us consider the cases 
of the fifty-three School A boys. If we record in the left hand 
column of the accompanying table the ranks of the boys in their 
own high school group, and in the second column their ranking 
when separated from the freshman college group, we may count 
the quintile gains or losses by subtracting each rank from the 
corresponding rank in the other series. If this difference equals 
one fifth of the total number of ranks in the series, it will register 
as one quintile change. If it equals two fifths of the number of 
ranks in the series, it will register as two quintiles change, etc. 
For example, in the table given herewith, from fourth to eight- 
eenth rank is a change of fourteen places and we register a loss of 
one quintile. From tenth to forty-ninth place is a drop of three 

quintiles, etc. 

TABLE 12A 
To Illustrate Method of Computing Quintile Gains and Losses 

Quintile Gains or Losses 





-1 


-1 




-3 






+3 

+2 

+2 





+2 

+ 1 





+2 





!. Ranks 


Freshman Ranks op Same 




Boys 


1 


3 


2 


5 


3 


2 


4 


18 


5 


8 


6 


19 


7 


7 


8 


14 


9 


9 


10 


49 


11 


17 


12 


6 


42 


3S 


43 


13 


44 


22 


45 


25 


46 


41 


47 


42 


48 


23 


49 


34 


50 


48 


51 


53 


52 


32 


53 


52 
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The same method exactly was followed when calculating the 
quintile gains and losses of the whole 218 group, with the addition 
of a mark attached to each number in the first column to indicate 
from what high school the boy came so that the quintile gains 
and losses could be credited to the proper school. 

By this method of calculation the following Table, No. 13, was 
constructed, showing the number and percentage of individuals 
from each school who maintained their quintile position, who 
gained rank by one-fifth, two-fifths, three-fifths, or four-fifths of 
the number in the group, and who lost rank by one-fifth, two- 
fifths, three-fifths, or four-fifths of the number in the group. 



TABLE 13 

Showing the Numbers Retaining Same Rank, and Numbers Changing 
Rank, prom High School to Freshman College, in Entire Group 
op 218 Bots (Compiled from Pettit) 





Retain 

Same 
Quintile 
Position 


Gain 1 
Quintile 


Gain 2 


Gain 3 


Gain 4 


Lose 1 


Lose 2 


Lose 3 


Lobe 4 


School B 

1st Quintile . . 
2nd Quintile . . 
3rd Quintile. . . 
4th Quintile. . . 
5th Quintile. . . 


19 
10 
11 
12 
10 


1 
2 

6 


1 
3 






3 
4 

2 


2 
2 






Total 


62 


9 


4 






9 


4 






Per cent 


71 


10 


4 






10 


i 






School A 

1st Quintile . . . 
2nd Quintile . . 
3rd Quintile. . . 
4th Quintile. . . 
5th Quintile.. . 


7 
8 
1 

10 
3 


1 
1 








3 
1 

7 


2 
4 
4 


1 




Total 


29 


2 








11 


10 


1 






55 


4 








21 


19 


2 




School C 

1st Quintile . . . 
2nd Quintile . . 
3rd Quintile. . . 
4th Quintile. . . 
5th Quintile. . . 


8 
12 
7 
9 
12 


1 
4 
6 
4 


1 
2 

4 


1 


1 


1 

3 
2 








Total 


48 


15 


7 


1 


1 


6 










62 


19 


9 


1 


1 


8 
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If now we consider the gain of two quintile ranks by one boy 
the equivalent of two quintile gains, and so on for gains and 
losses of three, or four quintiles, we may summarize the gains 
and losses by schools as follows: 

Schools Quintile Gains Quintile Losses Ex ™| S ^f™ 

B 17 17 

A 2 34 Loss 32 

C 36 6 Gain 30 

Totals 55 57 62 

From this we see that of the total changes (55 plus 57), sixty- 
two or fifty-five per cent were due to the sliding up or down the 
scale of the group from a particular school in mass. Surely so 
much of the transfers should not be considered negligible. 

The above fact can be verified from the less exact tables given 
by Pettit himself. By taking the difference between the num- 
bers from each school found in each quintile of the high school 
group and in the freshman group, we get a measure of the same 
fact. Table 14 gives those data: 

TABLE 14 

Membership fbom Each High School in Each Quintile in High School 
and Freshman Distributions 

Total 

School B 1st 2nd 3rd 4th 5th total Differ- 

ences 

High School Group 22 17 16 14 19 88 

Freshman Group 20 21 15 16 16 88 

Differences 2 4 1 2 3 12 

School A 

High School Group 13 14 12 11 3 53 

Freshman Group 7 7 9 12 18 53 

Differences 6 7 3 1 15 32 

School C 

High School Group 9 13 16 17 22 77 

Freshman Group 17 16 20 14 10 77 

Differences 8 3 4 3 12 30 

Total Differences 16 14 8 6 30 74 

According to Pettit's Chart 1, showing individual transfers 
from quintile to quintile, there were in all 126 quintile changes 
from high school ranks to freshman ranks. By the above table 
it appears that seventy-four of them, or fifty-eight per cent, can 
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be accounted for by the different standards of rating in the three 
high schools. 

Now for Dearborn's claim that to weight the marks to a com- 
mon average of all the schools would not alter the results. I 
have thus far shown only that if every boy had held exactly his 
high school rank among his own schoolmates when he did his 
freshman college work, there would still have been more than half 
as many changes in rank as Pettit found, and all because of the 
different standards in the three schools. If now there shall be 



TABLE 15 

Showing the Numbers Retaining Same Rank and the Numbers Chang- 
ing Rank prom High School to Freshman College, in Each School 
Group Considered Separately (Compiled from Pettit) 





Retain 

Same 

Quintile 

Position 


GainI 

QuTNTILE 


Gain 2 


Gain 3 


Gain 4 


Lose 1 


Lose 2 


Lose 3 


Lose 4 


School B 

1st Quintile . . 
2nd Quintile . . 
3rd Quintile. . . 
4th'Quintile. . . 
5th Quintile. . . 


14 
13 
11 
16 
10 


1 
2 

6 


1 
2 
1 






3 
2 
3 


2 

1 






Total 


64 


9 


4 






8 


3 






Per cent 


73 


' 10 


4 






9 


3 







School A 

1st Quintile . . . 
2nd Quintile . . 
3rd Quintile. .. 
4th Quintile. . . 
5th Quintile. .. 


7 
6 

7 
7 
5 


2 
1 
2 
1 


1 
4 


1 




2 
2 

1 


1 
2 


1 




Total 


32 


6 


5 


1 




5 


3 


1 






60 


11 


10 


2 




10 


5 


1 




School C 

1st Quintile . . . 
2nd Quintile . 
3rd Quintile. . . 
4th Quintile. . . 
5th Quintile. . . 


12 
10 
10 

9 


1 

2 
4 
2 


3 




1 


3 
2 
4 
3 


3 






Total 


50 


9 


3 




1 


12 


3 






Per cent 


64 


11 


4 




1 


15 


4 






Totals 3 schools 


146 


24 


12 


1 


1 


25 


9 


3 
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found to be as great shifting of position within each school 
group as was found in the entire group, so that the same amount 
of shifting will be found with weighting marks to a common 
measure as without weighting them, as Dearborn claims, that will 
be mere accident. Certainly it will be a different fact when 
determined for each school separately than when determined for 
the group as a whole. 

To enable me to make the comparison suggested above, it 
was necessary to calculate the quintile changes for each school 
group separately. These data are given in Table 15. 

From this table, No. 15, it will be observed that the total 
quintile changes, when calculated as indicated for Table 13, 
are 107. It will be recalled that this is about the same 
number as was found to represent the quintile changes in 
the whole group of 218 taken together (that number being 112), 
but it represents a very different fact. In the first we had a 
measure of the change in rank due to two causes combined, 
namely, the sliding up or down of entire school groups, and the 
shifting of position within the entire group; in the latter figure 
we have a measure of the shifting of positions by members within 
their own school group. The fact that the two measures are so 
nearly the same is an indication that the difference in standards 
from school to school is about the same as the difference in 
standards among teachers of the same school, and leads to the 
suspicion that standards are a mixture of about equal parts of 
tradition, which influences a school group, and individual notions 
of teachers. 

Turning now to the data furnished by Dearborn, I shall stop 
only long enough to point out that the differences in standards 
in Wisconsin high schools are not less than those in New York. 
His three groups of schools are, (1) Madison, (2) Milwaukee, 
three high schools together, and (3) four smaller high schools. 
Constructing a table for them similar to Table 12, for the New 
York high schools showing how the makeup of the highest and 
lowest quartile groups change from high school to freshman 
college, we have the following table, No. 16 : 
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TABLE 16 

Showing the Number from Each School Group Which Make up the 
Lowest and Highest Fourths op the Total High School and Fresh- 
man College Groups, and the Per Cent this Number is op the Num- 
ber Which Each School Should Have to Make the Representation 
Proportionate to the Whole Number prom that School (Compiled 
from Dearborn) 





Lowest Quartile 
H. S. Gboup 


Lowest Quartile 
College Group 


Highest Quartile 
H. S. Group 


Highest Quartile 
College Group 




No. 


% of Quota 


No. 


% of Quota 


No. 


% of Quota 


No. 


% of Quota 




85 

25 

8 


144 
72 
33 


64 
35 
19 


108 
100 
80 


41 
45 
32 


69 
129 
133 


49 
48 
21 


83 


Small High Schools 


137 

87 



Without using any more exact calculation than Table 16 
affords, we may see what share of the quartile changes is due to 
differences among the groups as wholes. In the lowest quartile 
we note that 

Madison loses 21 

Milwaukee gains 10 

Small high schools gain 11 

while in the highest quartile 

Madison gains 8 

Milwaukee gains 3 

Small high schools lose 11 

A grand total of sixty-four quartile changes in first and fourth 
quartiles, due to shifting of whole school groups. 

From Dearborn's table (page 14), we note that there are re- 
tained 

In the first quartile . 76 

In the fourth quartile 54 

A total of 130 retained in these two quartiles. 

The number in each quartile is 118, thus making 236 the 
number in both. If 130 retain their position, there are 106 as 
the total number of changes in the two quartiles. Of these 
there are sixty-four, or sixty per cent, which may be accounted 
for by differences in standards among the schools, as groups. 
It is scarcely to be expected that a calculation carried through 
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the two middle quartiles would change this percentage greatly. 
The diagram, Fig. 2, indicates that the reversal of position is 
about equally pronounced all along the line. 
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Fig. 2. (Data from Dearborn.) Division points between quartiles in the 
distribution according to high school marks and again in the distribution 
according to college freshman marks of the same pupils. 

Summarizing, it seems to me that in a study supporting the 
plan of entrance to college by accreditment, a plan which re- 
gards each high school as the final judge of the fitness of its 
students for college, a factor which is great enough to account 
for from fifty-five per cent to sixty per cent of the changes in 
rank (which changes are the bases of the study), should not have 
been disregarded. Furthermore, the plan of using averages of 
a number of teachers' marks to indicate student strength hides 
the most serious defect of the plan of accreditment, namely, the 
wide differences in standards among the teachers of any subject 
in the various accredited high schools. 

Whether or not the conclusion reached by Dearborn is sound, 
we cannot overlook the fact that in his carefully tabulated data 
is a fund of information which must be appraised. There is 
significance in the determination of how many pupils retain their 
quartile position from high school to college even when average 
of four years' work makes the high school rank in every case. 
As the easiest way to bring together the facts which we wish to 
evaluate, I have averaged the quartile retentions for each of the 
groups compared in the study, and have listed the averages in 
the following table, No. 17. To make clear the derivation of 
these "average quartile retentions," let me take the case of the 
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first one in the table, 45.2. On page 14 of Dearborn's study is 
the table indicating the quartile retention in university fresh- 
man class of the 472 students from eight high schools. Of the 
lowest quartile in high school (that is, the lowest 118 pupils) 
64.4 per cent are in the poorest quartile in freshman college work. 
Of the second quartile, in high school, 39.8 per cent are found in 
the second quartile in college freshman work. Of the third 
quartile, 31.4 per cent, and of the fourth quartile, 45.8 per cent, 
are in the corresponding quartile in college freshman work. 
The average of these four per cents is 45.2. The table will now 

be clear. 

TABLE 17 

Data Concerning Quartile Retention in College of the Represen- 
tatives or Eight High Schools in Wisconsin (Tabulated from Dear- 
born) 

Groups Compared 













Avg. 






No. of 


Quartile 


High School Group 


College Group 


Pupils 


Reten- 
tion 


8 H. S.'s, Genl. Avg. 


Freshman, Genl. Avg. 


472 


45.2 


" 


it 


Sophomore 


tt 


357 


43.6 


a 


a 


Freshman 


i * 


180 


40.7 


a i 




Sophomore 


t * 


180 


47.2 


tt t 




Junior 


t * 


180 


41.0 


a i 




Senior 


< * 


180 


41.5 


Madison 




Freshman 


t 


238 


43.7 


a t 




Sophomore 


i 


188 


45.7 


Milwaukee ' 




Freshman 


i 


139 


42.7 


a t 




Sophomore ' 


t 


99 


42.7 


4 small H. S.'s ' 




Freshman 


t 


92 


35.7 


it ( 




Sophomore " 


82 


40.5 


8 H. S.'s English 


Freshman English 


255 


36.5 


" Mathematics 


Mathematics 


216 


33.7 


" German 


German 


189 


42.2 


" History 


" History 


219 


45.2 


Madison, Genl. Avg. 


Genl. Avg. * 


115 


41.0 


it a 


Sophomore " * 


115 


47.5 


it tt 


Junior ' ' * 


115 


33.0 


n tt 


Senior " * 


115 


44.5 


Madison Mathematics 


Freshman Mathematics 


181 


51.2 


" English 


" English 


244 


36.0 


" German 


German 


126 


42.5 


" History 


History 


97 


37.5 


" Latin 


" Latin 


39 


46.0 


" Mathematics 


Mathematics 


69 


34.5 


Milwaukee English 


English 


92 


46.2 


History 


History 


64 


48.5 


4 small H. S.'s English 


English 


53 


33.5 


History 


History 
lual weight) 


42 


33.2 


Average (counting all of e< 


41.43 



* Those who completed the college course. 
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Summarizing and Averaging Groups from the Above Table 



8 H. S.'s Genl. Avg. 

Madison 

Milwaukee " 

4 small H. S.'s Genl. Avg. 



Freshman Genl. Avg. 



45.2 



Average 40.7 



8 H. S.'s Genl. Avg. 

Madison 

Milwaukee " 

4 Small H. S.'s Genl. Avg. 



Sophomore Genl. Avg. 



[ Average 



43.6 



42.9 



(For groups finishing college) 

8 H. S.'s Genl. Avg. and Fresh., Soph., Junior, and Senior 42.6 

Madison " " " " " " " 41.4 



(Individual subjects in H. S. and College) 

8 H. S.'s English Freshman English 

Madison " 

Milwaukee " " 

4 small H. S.'s English " " 



Average 



36.5 
38.6 



8 H. S.'s History 

Madison " 
Milwaukee " 
4 small H. S.'s History 



Freshman History 



Average 



45.2 
39.7 



From the above table it will be observed that the average 
quartile retentions range from 51.2 to 33.0. As a central tendency 
for all these retentions the rough average was calculated by simply 
giving each figure for retention its face value. The average thus 
determined is 41.43. 

The question which these data present to us is this: How sat- 
isfactory is an average quartile retention of 41.43 per cent? 
To be sure, no definite answer can be given, but it is possible to 
consider the question and get a clearer idea of it than appears 
on the surface. 

It will be noted first that in a quartile arrangement an abso- 
lutely random redistribution would result in a quartile retention 
of 25 per cent. In our retention of 41.43 per cent we have 
evidence that 16.43 per cent more than a random redistribution 
would make, retain the same quartile position in college classes 
that they had in the average of high school. In other words, 
we have such a quartile retention that 60.3 per cent is accounted 
for by a random redistribution. I state it in this way so as to 
make it comparable with results secured in Clement's and Gray's 
studies to be considered later. 
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In the second place, a glance at the summary at the bottom of 
the table will indicate another feature of this retention. It will 
be observed that whenever the general averages of freshman 
marks in all subjects are considered, the retention is greater for 
the group of eight high schools taken together than for each high 
school group taken separately. In this connection I must re- 
peat what was said in criticism of using correlation of averages 
to support the accreditment plan of college admission, that the 
farther the groups being compared are removed from the ranks 
or marks given by individual teachers the closer is the correlation 
between them. If, however, the marks of individual teachers 
were a close approximation of student ability in the subject, 
then the nearer the groups would approach to individual teachers' 
marks in allied subjects, the closer would be the correlation. 
When we turn to the retention indicated for separate subjects, 
we find that while they are in practically every case lower than 
the general averages, those in which the eight high schools are 
grouped together are on the whole a little higher than those for 
the individual school groups. In this case we have an indication 
of different standards of marking in the different high schools 
composing each group, so great that it cannot be counter- 
balanced without the use of averages of several subjects. 

We are now ready to consider the evidence bearing upon 
standards of rating pupils in the high school which is given in 
the two most recent studies of the subject, the one by Clement, 
and the other by Gray. In both of these the plan of comparing 
marks of pupils with marks given the same pupils in later years 
is used, but there is a minimum of averaging, and little com- 
bination of several schools into one group. Thus the fallacies 
pointed out in Dearborn's work are avoided in these, and we have 
the task of evaluating the wealth of material which these two 
studies provide. 

Clement used the records of nearly 5,000 high school graduates, 
mostly in Kansas schools. Twenty-three high schools of rep- 
resentative sorts were included. The records of as many of 
these 5,000 high school pupils as possible were traced back into 
the grammar school and forward into college. Of course rela- 
tively few could thus be traced, but nevertheless the long list of 
comparisons which he was able to make affords the richest mine 
of information concerning marks that we have. 
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His method was to compare each group with itself in some 
later class, indicating in the comparison the per cent who re- 
tained their original tertile position. To make this plain I 
have here reproduced one of the tables of comparison which he 
uses. From this it will be clear that of the thirty-seven pupils 
who were in the first tertile in seventh grade history, 22 were in 
the first tertile in eighth grade history, eleven in the second tertile, 
and four in the third tertile. The retention, then, is twenty-two 
out of thirty-seven, or 59.45 per cent. The total or average 
retention of the whole class is seen to be 51.78 per cent. It is 
this average tertile retention which represents the most significant 
fact for our purposes, and we shall, therefore, assemble into one 
series of tables for easy study the figures representing tertile 
retention from group to group which are scattered through 
Clement's study. 



History, School 5, Eighth Grade 







1 


2 


S 


% Tertile Reten- 
tion 


History, School 5, Seventh 
Grade 


1 


22 


11 


4 


59.45 


2 


11 


15 


12 


39.49 




3 


4 


12 


21 


56.75 












51.78 















While tabulating the tertile retentions we shall also tabulate 
tertile division points, those points below which one third of the 
group fall, and above which another third of the group fall, so 
as to make easy the comparison of range of ratings between 
school and school, or group and group. 

Clement uses a second method of indicating retention of posi- 
tion, a method which he calls the "modified median method." 
For this index he calculates the percentage of the lowest third 
who remain below the median in the next grouping, and the 
percentage of the highest third who remain above the median 
in the next grouping, and then averages the two percentages. 
Wherever this method was used by Clement, I have copied his 
figures in the column headed "Median Retention." 
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These data are all given in the following tables, numbered 18, 
19, 20, and 21: 

TABLE 18 

Data for Marks Given Successive Classes in the Grammar Schools 
(Tabulated from Clement) 



School Pupils 
No. 5 112 



No. 5 

No. 5 



112 



112 



Classes Com- 
pared 

7th history 
8th history 

7th English 
8th English 

7th arithmetic 
8th arithmetic 



Division Points 
between tertiles 



Avg. Tek- Median 
tile Re- Reten- 



88.1 
89.1 

85.8 
86.4 

87.4 
90.5 



94.1 
94.1 

91.0 
90.1 

92.5 
94.9 



51.76 
54.46 
45.53 



79.76 
77.05 
67.56 











Average 


50.58 


74.79 


No. 5 


112 


7th English 
7th history 






51.78 




No. 5 


112 


7th English 
7th arithmetic 






49.10 




No. 5 


112 


8th English 
8th history 






53.57 




No. 5 


112 


8th English 
8th arithmetic 




Average 


52.67 






51.78 




No. 7 


78 


7th English 
8th English 


86.2 
85.3 


93.2 
91.2 


51.28 


76.73 


No. 7 


78 


7th arithmetic 
8th arithmetic 


83.9 
86.6 


92.6 
92.9 


56.41 


74.98 



Average 



53.84 



75.85 
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TABLE 19 

Data for Marks Given to Successive Classes in High Schools 
(Tabulated from Clement) 



School Pupils Classes C om . 

PARED 


Division Points 
between tertiles 


Avg. Ter- 
tile Re- 
tention 


Median 
Reten- 
tion 


No. 8 


126 


Fresh. English 
Soph. English 


84.4 
83.4 


90.4 
90.8 


59.52 


88.09 


No. 8 


126 


Soph. English 
Jun. English 


83.4 
80.3 


90.8 
90.4 


73.01 


90.47 


No. 8 


114 


Fresh. Latin 
Soph. Latin 


85.2 
80.5 


92.1 
90.6 


65.78 


86.83 


No. 8 


125 


Fresh, math. 
Soph. math. 


84.1 
80.6 


90.7 
90.0 

Average 


48.00 


71.44 




61.58 


84.20 


No. 9 


160 


Fresh. English 
Soph. English 


82.9 
81.8 


88.4 
87.4 


52.50 


72.63 


No. 9 


160 


Soph. English 
Jun. English 


81.8 
80.4 


87.4 
85.5 


60.00 


75.45 


No. 9 


93 


Fresh. Latin 
Soph. Latin 


83.6 
79.4 


89.9 
84.6 


54.83 


70.96 


No. 9 


117 


Fresh, math. 
Soph. math. 


83.0 
78.9 


90.5 

85.5 

Average 


58.11 


87.17 




56.36 


76.55 


No. 5 


212 


Fresh. English 
Soph. English 


81.6 
81.6 


87.9 
86.5 


49.52 


76.05 


No. 5 


212 


Soph. English 
Jun. English 


81.6 
80.4 


86.5 
86.5 


55.66 


80.84 


No. 5 


217 


Fresh. Latin 
Soph. Latin 


85.0 

78.5 


90.4 
86.3 


52.99 


81.24 


No. 5 


212 


Fresh, math. 
Soph. math. 


85.6 
81.6 


90.3 

89.7 

Average 


58.49 


76.80 




54.16 


78.73 


Nos. 8, 
9 and 5 


633 


Fresh. English 
Soph. English 


82.9 
82.0 


88.7 
88.2 


54.50 




Nos. 8, 
9 and 5 


633 


Soph. English 
Jun. English 


82.0 
80.5 


88.2 
86.5 


52.76 




Nos. 8, 
9 and 5 


467 


Fresh. Latin 
Soph. Latin 


84.6 
79.6 


90.8 
87.6 


58.68 




Nos. 8, 
9 and 5 


589 


Fresh, math. 
Soph. math. 


84.8 
81.4 


90.6 
89.4 


55.68 





Average 55 . 40 
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TABLE 20 
Data for Marks Given to Classes in Grammar School and High School 
(Tabulated from Clement) 
Classes Com- 



School Pupils 

No. 5 212 

No. 5 212 

No. 5 212 

No. 5 212 

No. 5 212 

No. 5 212 

No. 5 212 

No. 5 181 

No. 8 126 

No. 10 150 



Division Points 

between tertiles 



Avg. Ter- 
tile Re- 
tention 



Median 
Reten- 
tion 



No. 7 



No. 7 



No. 7 



270 
270 
270 



No. 6 338 



No. 6 302 



No. 6 338 



8th English 
Fresh. English 

8th English 
Soph. English 

8th English 
Jun. English 

8th English 
Sen. Engiish 

8th arithmetic 
Fresh, math. 

8th arithmetic 
Soph. math. 

8th history 
Soph, history 

8th English 
Fresh. Latin 

8th English 
Avg. 3 yrs. H. S. 
English 

8th English 
Avg. 3 yrs. H. S. 
English 

8th English 
Soph. English 

8th English 
Fresh. English 

8th arithmetic 
Fresh, math. 

8th arithmetic 
Avg. Fresh, and 
Soph. math. 

8th English 
Avg. Fresh, and 
Soph. Latin 

8th English 
Avg. Fresh, and 
Soph. English 



86.4 
81.6 

86.4 
81.6 

86.4 
80.4 

86.4 
80.9 

88.7 
85.6 

88.7 
81.5 

89.2 
82.4 

86.4 
83.0 

88.3 
82.5 



82.9 

88.2 



84.3 
86.0 

84.3 
85.2 

83.8 
86.2 



90.5 
87.9 

90.5 
86.5 

90.5 
85.5 

90.5 
86.4 

93.8 
90.4 

93.8 
89.7 

93.8 
88.0 

90.9 
89.1 

93.8 
90.1 



89.7 
91.3 



90.4 
92.6 

90.4 
92.6 

91.4 
92.7 



(Coarse grouping of 
marks makes these 
division points un- 
certain) 



46.17 



44.33 



53.17 



43.39 



44.81 



41.50 



48.11 



50.82 



46.30 



48.00 



45.74 



43.70 



48.51 



46.15 



53.31 



56.21 



66.18 



64.78 



71.83 



67.60 



61.96 



68.30 



73.93 



71.66 



71.43 



73.00 



69.94 



67.21 



69.99 



69.02 



74.75 



77.87 
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TABLE 20— Continued 

Data for Marks Given to Classes in Grammar School and High School 

(Tabulated from Clement) 



School Pupils Classes Com- 
pared 


Division 

BETWEEN 


Points 
Tertiles 


Avg. Ter- 
tile Re- 
tention 


Median 
Reten- 
tion 


No. 2 


97 


7th arithmetic 
Avg. Fresh, and 
Soph. math. 






53.60 




No. 2 


72 


7th English 
Fresh. Latin 


(Schools 2, 3, and 4 
in the same city 


46.05 




No. 2 


97 


7th English 
Fresh. English 


complete 
school • 
grade 


! grammar 
with 7th 


50.51 




No. 2 


97 


7th English 
Avg. Fresh, and 
Soph. English 






56.70 




No. 3 


93 


7th arithmetic 
Avg. Fresh, and 
Soph. math. 






56.98 




No. 3 


78 


7th English 
Fresh. Latin 






55.12 




No. 3 


93 


7th English 
Fresh. English 






59.13 




No. 3 


93 


7th English 
Avg. Fresh, and 
Soph. English 






46.23 




No. 4 


73 


7th English 
Avg. Fresh, and 
Soph. Latin 






43.97 




No. 4 


73 


7th English 
Fresh. English 






47.94 




No. 4 


73 


7th English 
Avg. Fresh, and 
Soph. English 






46.57 




No. 4 


73 


7th arithmetic 
Avg. Fresh, and 
Soph. math. 






43.14 




Nos. 2, 
3 and 4 


299 


7th English 
Avg. Fresh, and 
Soph. English 


80.5 
79.0 


88.8 
86.0 


42.47 


66.00 


Nos. 2, 
3 and 4 


299 


7th arithmetic 
Avg. Fresh, and 
Soph. math. 


81.7 
75.1 


89.0 
83.0 


43.00 


65.50 


Nos. 2, 
3 and 4 


166 


7th English 
Avg. Fresh, and 
Soph. Latin 


83.5 
76.0 


89.8 
85.3 


46.67 


68.27 






Average (giving equal weight to each 
group regardless of size) 


48.36 


69.43 
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TABLE 21 
Data fob Marks Given to Classes in High School and College 
(Tabulated from Clement) 

School Pupils Classes Compared 

266 



Division Points 
between tertiles 



H. S. No. 1 266 H. S. Fresh. English 
Col. No. 1 Col. Fresh. English 

H. S. No. 1 266 Avg. 3 yrs. H. S. Eng- 
lish 
Col. No. 1 Col. Fresh. English 

H. S. No. 1 86 Avg. 3 yrs. H. S. Eng- 
lish 

Col. No. 1 Avg. 4 yrs. Col. Eng- 

lish 

H. S. No. 5 81 H. S. Fresh. English 
Col. No. 2 Col. Fresh. English 

H. S. No. 5 81 H. S. Soph. English 
Col. No. 2 Col. Fresh. English 

H. S. No. 5 81 Avg. 4 yrs. H. S. Eng- 
lish 
Col. No. 2 Col. Fresh. English 

H. S. No. 5 60 Avg. Fresh, and Soph. 

H. S. Math. 
Col. No. 2 Col. Fresh. Math. 

H. S. No. 7 84 H. S. Fresh. English 
Col. No. 1 Col. Fresh. English 

H. S. No. 6 184 H. S. Fresh, math. 
Col. No. 3 Col. Fresh. Math. 

H. S. No. 6 165 H. S. Fresh. English 
Col. No. 3 Col. Fresh. English 



91.2 
84.0 

89.0 

84.0 

91.5 

84.0 

82.6 
80.4 

82.6 
80.4 

82.6 

80.4 

85.5 

76.5 



95.0 
92.0 

94.0 

92.0 

96.0 

92.0 

87.9 

88.5 

87.4 
88.5 

86.7 

88.5 

90.5 

84.3 



Avg. Tee- 
tile Re- 
tention 



These schools used 
only three marks, 
and the tertile re- 
tention was calcu- 
lated on basis of 
these marks. 



50.00 

59.02 

60.46 

35.80 
53.08 

45.67 

40.00 
53.57 

60.32 
43.00 



Average (giving equal weight to all groups) 50 . 09 
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Some Averages from the Above Tables to Indicate Influence of Dif- 
ference of Standards among the Schools Making tip a Group, 
upon Terttle Retention 

First, from Table 19: 

The average of all retentions by single subjects from year to year in 

Schools 8, 9, and 5 taken separately 57 .4 

The average of all retentions by single subjects from year to year 

in Schools 8, 9, and 5 taken together 55.0 

Second, from Table 20: 

The average of all retentions from seventh arithmetic to the average 
of freshman and sophomore mathematics in Schools 2, 3, and 4 
taken separately 51 .2 

The average of all retentions from seventh arithmetic to the aver- 
age of freshman and sophomore mathematics in Schools 2, 3, and 
4taken together 43.0 

The average of all retentions from seventh English to the average 
of freshman and sophomore English in Schools 2, 3, and 4 taken 
separately 49.8 

The average of all retentions from seventh English to the average 
of freshman and sophomore English in Schools 2, 3, and 4 taken 
together 42 . 5 

In connection with the foregoing tables we must ask the same 
question as was asked concerning the data in Dearborn's study: 
How satisfactory is the retention here indicated? Between 
successive classes in grammar school there is a tertile retention of 
slightly more than 50 per cent. It is no greater on the average 
than the retention between different subjects taken during the 
same year, however, which indicates that a pupil who is good in 
history, say, this year is as likely to be found among the good pu- 
pils in arithmetic this year as he is to be found among the good 
pupils in history next year. 

Among successive classes in high school we find a little higher 
tertile retention. This may be accounted for in part perhaps 
by the fact that most of the work in high schools is done depart- 
mentally so that the class in freshman Latin, say, this year will 
be taught sophomore Latin next year by the same teacher. In 
that case the personal equation would weigh in the same direction 
in successive years, and work to increase tertile retention. At 
any rate, there seems to be a retention averaging about 57 per 
cent between the same subjects in successive years in high 
schools. 

Turning to the retention in high school of grammar school 
ranks, we find much lower figures. For all the schools con- 
sidered the average retention is a little below 50 per cent. It 
may be urged that this is probably caused by the abrupt change 
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in both subject matter and method between grammar school 
and high school. While this no doubt has some effect it must be 
remembered that the retention between successive years in 
grammar school where the subject matter is very closely similar 
from year to year, was but little higher. The children change 
teachers from seventh grade to eighth grade, just as they do 
from eighth grade to freshman high school, and that is probably 
the greatest reason for the lower retention found between these 
two groups than occurs between successive years in high school 
where teachers do not as a rule change. 

Between high school and college the retention is, on the av- 
erage, about 50 per cent. It will be observed that the highest 
figure in the list is that for the average of all high school with 
the average of all college English, an evidence again of the oft- 
noted fact that averaging marks for several years or several 
subjects tends to cover up the most serious fault of our present 
marking system. 

It appears, then, that the tertile retention for all classes in all 
schools and between the various schools is a little above 50 per 
cent. Now how satisfactory is a tertile retention of 50 per cent? 
Bearing in mind that a perfectly random redistribution at each 
successive marking would produce a tertile retention of 33.3 
per cent, we have in this retention of 50 per cent such a retention 
that a random redistribution accounts for 66.7 per cent of it. 
If we use the figures given for "modified median retention," we 
note that on the whole they run a little less than 75 per cent. 
By this method of calculating retention, a random redistribution 
would produce 50 per cent retention. Here again, then, we 
have evidence that chance accounts for 66.7 per cent of the 
retention. It seems fair to make a comparison on this basis, 
with Dearborn's data for certain Wisconsin high schools. It 
was found that chance accounted for but 60.3 per cent of the 
retention there, although comparisons were made only between 
high school and college marks. 

Before leaving Clement's study attention may be called to 
the list of tertile division points. It will be seen that not only 
are there some rather marked differences in the standards of 
marks used among the various departments of the same school 
as well as among like subjects in various schools, but there is a 
most consistent tendency to reduce the marks from one school 
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to a higher school. That is to be expected, however, when we 
consider that the poorer members of the group drop out before 
they enter the next higher school, and of course a normal dis- 
tribution made for the members of each successive group would 
tend to distribute ever lower and lower those members of the 
original group who persisted to the end. I call attention to 
this fact here because it has a very definite bearing upon the 
question as to whether we should plan for a more and more 
skewed distribution as we advance in the grades where elimi- 
nation takes place. 

In Gray's study we have the most significant type of data yet 
gathered bearing upon the subject of marks. He considers the 
individual records of pupils from ten different high schools, 
mostly in Indiana. He does not tell 1 how many records enter 
into his conclusions, but it is fair to assume that he used enough 
to make his figures valid. Neither is it stated that only high 
school graduates were used, but that fact is implied throughout, 
and I shall act upon that assumption. 

The method used by Gray was that of calculating the number 
of points change which occurred from one mark of a pupil to the 
next in the same subject in the high school. For example, if a 
mark of 80 was received in freshman history and 85 in sopho- 
more history, a change of 5 points was recorded for that promo- 
tion. Similarly all changes were recorded and averages struck 
for each school in each department of study. 

Incidental to this main purpose Gray pointed out many 
irregularities and variations in standards of grading as well as 
forms of distributions of marks which were to be found in the 
several schools and departments. These furnish most impres- 
sive evidence of the need of some method of standardization in 
high school work, but I shall not undertake detailed comment 
upon them. I shall rather confine my attention to his main 



1 Upon direct inquiry from Mr. Gray I learn that the following numbers of 

pupils from each school entered into his records and that all had completed 
high school except 75 in School 2 : 

School 1 140 School 6 25 

School 2 100 School 7 25 

School 3 135 School 8 30 

School4 35 School9 25 

School 5 25 School 10 30 
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tables in which he summarizes the variations in marks above 
referred to. From his tables we assemble the data in Table 22. 

TABLE 22 

Average Variation in Points at Each Promotion for Each Pupil in 
Each Subject (Tabulated from Gray) 

SchooL £S*5£ Schools Ian, 3 Other Schools 

English 10 4.0 3.95 4.04 

History 9 3.8 3.85 3.77 

Mathematics 10 4.3 4.65 4.22 

Latin 9 3.5 3.55 3.53 

Mod. Lang 9 3.2 3.70 3.02 

Science 10 4.7 1 4.30 4.87 

Averages 3.92 4.00 3.91 

To discover whether the larger numbers of pupils studied in 
Schools 1 and 3 than in the other schools makes the results from 
them differ widely from the other schools, I separated them from 
the rest and compiled column 2 of the table. ^j 

In trying to answer the question, "How satisfactory is ^a 
variation of 3.92 points?" we find it difficult to get a satisfactory 
basis of comparison with the results given by Clement and 
Dearborn. In the effort to make such comparison we have 
worked upon an assumption which is not capable of absolute 
proof. For those who may wish to discount the assumption, 
however, this statement will form a basis of comparison which 
will be more helpful than no basis. 

In the following discussion I shall try to answer the question, 
"What per cent is 3.92 points variation, of the variation which 
would occur by a perfectly random redistribution at each in- 
stance of remarking by the teachers?" To do this I have to 
make an assumption of the typical range of marks which these 
classes would fall into, and also the form of distribution into 
which they would fall. Several elements enter to guide this 
assumption. First, if the list is confined to those who continue 
four years in high school, the distributions of the early high 
school classes will be bunched pretty high on the scale, and 
therefore not make a wide distribution. In the second place, 
the large number of graphs given by Gray represent in almost 
every instance no cases below seventy-five, and I therefore 

1 Gray's text errs in making this figure 3.7. 



3 to 82 


83 to 87 


88 to 92 


93 to 97 


20% 


40% 


20% 


10% 


22% 


50% 


22% 


3% 
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assume that the schools had seventy-five as a passing mark. 
In the third place, an examination of the graphs published by 
Gray reveals that from 80 per cent to 90 per cent of the cases 
there represented do actually fall, in the majority of instances, 
within a range of fifteen marks. On these grounds I have 
assumed that the groups from which Gray's cases were drawn 
distribute themselves midway between the two forms recom- 
mended as most appropriate by J. McK. Cattell, 1 and Max 
Meyer, 2 in which each of the five steps of the scale is considered 
as spread over five points as follows: 

73 to 77 

Cattell 10% 

Meyer 3% 

For the sake of simplicity, consider a class of 100 pupils dis- 
tributed according to the Cattell distribution. Upon a random 
redistribution at promotion time, or rather the time of remarking, 
they would be found in the following arrangement: 

73 to 77 78 to 82 83 to 87 88 to 92 93 to 97 

Lowest 10 12 4 2 1 

Next 20 2 4 8 4 2 

Middle 40 4 8 16 8 4 

Next 20 2 4 8 4 2 

Highest 10 12 4 2 1 

To obtain these positions, the following number of points 
changes or variations would be sustained: 

Lowest 10 10 40 30 20 

Next 20 10 40 40 30 

Middle 40 40 40 40 40 

Next 20 30 40 40 10 

Highest 10 20 30 40 10 

Totals 100 120 160 120 100 600 

It is evident, then, that these 100 pupils if distributed accord- 
ing to the Cattell scheme would make a total of 600 points 
changes with a random redistribution, or an average of six points 
per pupil. 

1 J. McK. Cattell, Examinations, Grades and Credits, Pop. Sci. Monthly, 
* Max Meyer, The Grading of Students, Science (n. s.), 28:243-252. 
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Consider now the Meyer distribution in the same way. A 
random redistribution of the 100 pupils would produce the 
following arrangement of the members of each group : 

Lowest 3 09 .66 1.5 .66 .09 

Next 22 66 4.84 11.0 4.84 .66 

Middle 50 1.50 11.00 25.0 11.00 1.50 

Next 22 66 4.84 11.0 4.84 .66 

Highest 3 09 .66 1.5 .66 .09 

To accomplish the above arrangement, the following amount 
of changes in points would be involved: 

Lowest 3 3.3 15.0 9.9 1.8 

Next 22 3.3 55.0 48.4 9.9 

Middle 50 15.0 55.0 55.0 15.0 

Next 22 9.9 48.4 55.0 3.3 

Highest 3 1.8 9.9 15.0 3.3 

Totals 30.0 116.6 140.0 116.6 30.0 433.2 

From this table it appears that a random redistribution of 100 
pupils arranged after the Meyer plan produces 433 points changes, 
or an average of 4.33 points per pupil. 

It seems fair to take the average of the two figures obtained 
from these two calculations, 5.17 (that is, 6 plus 4.33, divided 
by 2) and consider it the number of points change which would 
accompany a chance arrangement of grades at each remarking 
of Gray's people. Now we have a basis of comparing the 
retention of position in these schools taken pupil by pupil, 
subject by subject, with the retention found by Clement, who 
combined several classes to make his groups, and with Dearborn, 
who used the averages of several years' marks. The average 
number of points change actually found is 3.92 per pupil. A 
chance redistribution would make 5.17 points change per pupil. 
We have, then, but 25 per cent improvement over a chance 
redistribution. 

In the case of Dearborn's data, we were able to make the 
statement that the retention was such that chance accounted 
for 60.3 per cent of it. With Clement's data, chance accounted 
for 66.7 per cent of the retention. While we cannot make a 
similar statement for Gray's data we can get a fairly clear idea 
of how it compares in respect to retention by saying that the 
points changes per pupil are 75.8 per cent as great as they would 
be by chance. I recognize that it is dangerous to press this 
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comparison far. It cannot be demonstrated that it is at all a 
sound basis of comparison, but is the only way that we may 
think a relationship among them. Unfortunately the data are 
not so arranged that a coefficient of correlation can be calculated 
for each resemblance. By means of this basis of comparison we 
can at least see that the close correlation which we should expect 
between marks in successive years in the same subject does not 
exist, but that on the contrary, the more estimates we average 
to get a pupil's rank, the closer the correlation. In other words, 
just as Clement found in his few classes, teachers' marks in any 
subject are an index of general ability quite as much as they are 
an index of special ability in the given subject. This surely 
points to a sort of dead level of student interest in high school. 
But turning now to the actual retention found, are we ready 
to accept the standard which Clement says we may judge our 
schools by, namely, a tertile retention of approximately 50 per 
cent? If we can come no nearer than that in ranking our chil- 
dren for general ability, we cannot hope to command much respect 
as a teaching profession. Rather should the revelations made 
by these studies open our eyes to the real need for some more 
effectual method for establishing standards whereby both teach- 
ers and pupils may measure progress. No more striking illus- 
tration of the far-reaching effect of having no definite standards 
could be found than just what these studies reveal: teachers do 
not draw out special abilities from their high school pupils. No 
more fruitful source of discouragement and of elimination exists 
to-day than just that failure to find and develop the special 
interests of the pupils. 



STANDARDS OF MARKING IN COLLEGES 

The non-uniformity of standard of marking among the in- 
structors in colleges was first brought forcibly to public atten- 
tion by Professor Max Meyer 1 in the University of Missouri. 
He collected all the marks for a period of five years of forty 
instructors, mostly in the College of Arts and Sciences, all but 
two of whom had the rank of professor or assistant professor. 
The marks were all in terms of the uniform system, A, B, C, D, 
and E. D meant failure with the privilege of another exami- 
nation, and E meant failure without such privilege. Meyer 
combined the D's and E's, using the letter F for the combined 
group. He then tabulated for each instructor the number of 
classes he had taught during the five years, the total number of 
marks he had given, and the per cent which he had used of each 
letter, A, B, C, and F. In addition to these facts, he calculated 
also the coefficient of variability in the giving of each letter from 
class to class. This coefficient is derived by dividing the average 
variation from the average percentage which any professor 
assigns a given mark, by the average percentage assigned that 
mark. For example, the philosophy professor listed in the 
table gave 55 per cent of his people A, on an average, but he 
varied from class to class by an average variation of 11 per cent. 
Therefore, the coefficient of variability is 11/55, or .2. These 
data for the half dozen instructors at either extreme of the list 
are reproduced in Table 23. 

The need for the reform in marking, which was effected in the 
University of Missouri shortly after Meyer's investigation, is 
evident from the above table. It is not to be supposed, how- 
ever, that Missouri was exceptional in this absence of uniformity. 
There have been enough similar investigations in other insti- 
tutions to prove that just such variation is the rule among col- 
lege instructors. 

We are not much surprised by facts brought out by Meyer. 
In fact there still persists a very general feeling that college 
instructors should be allowed practically absolute freedom to 

1 Max Meyer, The Grading of Students, Science, 28:243-252. 
48 
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TABLE 23 

Showing the Vast ability of Masking by Instructors in the University 
op Missouri (From Meyer) 

Instructor in %A %B %C %F Total No. of Coefficients of 

/o* /c /o /o Marks Classes Variability 

A 7? O P 

Philosophy 55 33 10 2 623 29 .2 .3 .8 1.2 

Latin I 52 42 6 130 9 .3 .3 1.2 — 

Sociology 52 30 13 5 958 47 .3 .5 .9 .9 

Mathematics I 40 31 16 13 208 19 .6 .6 .8 .9 

Economics 39 37 19 5 461 28 .4 .4 .7 .9 

Greek 39 26 24 11 287 30 .4 .4 .5 .9 

Average 46 33 15 6 

Engineering I .13 36 42 9 813 39 .6 .3 .2 1.0 

Mech. Drawing 18 29 41 12 558 28 .4 .4 .3 .9 

Mechanics 18 26 42 14 495 12 1.1 .3 .3 .4 

Engineering II 16 26 46 12 826 ? .3 .3 .3 .9 

English II 9 28 35 28 1098 44 .8 .3 .3 .4 

Chemistry III . 1 11 60 28 1903 12 1.0 .6 .1 .3 

Average 12.5 26 44 17 

conduct their classes in any way they see fit, and so we rather 
expect to see individual standards manifested in the marks 
given. It should be kept in mind, however, that the adoption 
of some method whereby a given mark may signify more nearly 
the same merit in the several departments, is not a restraint 
upon that cherished independence. 

Since all the studies made in this field point to the same 
variation, it seems unnecessary to do more than indicate the 
institutions where such investigations have been made. This 
will suffice to establish the claim that standardization is as 
much needed in college as in high school or elementary school. 

In the appendix to Dearborn's " School and University Grades " 
is a series of tables setting forth in great detail the distributions 
of marks given at the University of Wisconsin. William T. 
Foster 1 worked out with similar care the marks given at Har- 
vard. His graphical representations tell a very plain story of 
the situation there. While we scarcely need a proof of the con- 
tention that low standing in a course is not prophetic of failure 
in one's career, yet the table indicating the undergraduate marks 
received by men of honor standing in the professional schools 
shows pretty plainly where the relationship does hold. Foster 

1 William T. Foster, Scientific vs. Personal Distribution of College Credits, 
Pop. Sc. Mo., 78:378-^08. 
5 
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gives also the significant facts concerning the variation among 
instructors' marks at the University of California. 

In the 1910-11 report of the President of the University of 
Chicago, pages 91 to 94, we h,ave a table indicating the wide 
variations among marks given by the different instructors in 
that institution. 

Edwin E. Slosson 1 after examining the situation at Amherst 
College records his conviction that the marking there tells more 
about the instructors than about the students. 

In 1905 Cattell 2 tabulated a few of the markings of Columbia 
University instructors as a basis for his recommendation concern- 
ing a proper type of distribution. A far more exhaustive study 
of the marking system of Columbia was made in 1906, however, 
by Miss Mary T. Whitley. 3 From her report, which appeared 
in the form of a Master's essay, it is clear that Columbia stood 
at that time high in the list of institutions giving to instructors 
a maximum of individual liberty. 

By all these studies the significance of President Foster's 
question is emphasized: "Can the personal equation as the 
chief factor in the awarding of college marks be supplanted by 
scientific guidance?" A partial answer to this question is what 
is attempted in a later section of this discussion. First, however, 
we must evaluate our present common means of standardiza- 
tion, the examination paper. 



1 E. E. Slosson, A Study of Amherst Grades, Independent, April 20, 1911. 

2 J. McK. Cattell, Examinations, Grades and Credits, Popular Science 
Monthly, 66; 367-378. 

3 Mary T. Whitley, Statistical Study of College Marks, Master's essay, 
Teachers College, 1906. 



THE MARKING OF EXAMINATION PAPERS 

The use of examination papers as a means of measuring knowl- 
edge, or efficiency, or mental ability, or whatever name may be 
given to that which is supposed to indicate one's fitness for a 
particular grade of work is almost a universal custom in our 
schools. It is being extended more and more each year to civil 
service and industrial positions. In spite of this the few studies 
which have been made reveal a very wide difference of rating 
upon the same paper among supposedly competent judges. We 
shall not in this section attempt to analyze the situation to 
determine the causes of variation among judges. We shall 
merely indicate how reliable examinations in actual practice 
are, in order to have some basis for our expectations concerning 
the use of standard tests or scales for evaluating papers. 

F. Y. Edgeworth, professor of Political Economy at the Uni- 
versity of Oxford, was among the first to call wide attention to 
this variation. The care with which his first experiment was 
conducted justifies a full statement of it here. In 1889 he 
inserted a specimen of Latin prose composition in the English 
Journal of Education accompanied by a request that competent 
persons rate the paper. Quoting from his article: "I propose, 
through the medium of the Journal of Education, to invite any 
competent person to assign a mark to the subjoined piece of 
Latin prose, upon the supposition that he is marking the work 
of a candidate for the India Civil Service. Let it be distinctly 
understood that in giving his mark the examiner is not to look 
to, or wish to illustrate, his own ideal of classical elegance nor 
yet the degree of proficiency which may be current in the school 
or other institution with which he may be connected. Let him 
imagine that he has been appointed examiner in Latin for the 
India Civil Service, and let him give his mark, having regard 
only to what may be expected from a candidate for that prize. 
Let 100 be the maximum attainable by any candidate. 

"To avoid accidental divergence as much as possible, to per- 
form the experiment under the most favorable conditions, I 

51 
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would suggest that the examiners should consist of a pretty 
homogeneous class — of much the same class as those who actually 
conduct our public examinations. To be more definite I would 
invite to take part in this experiment only those who have 
taken high honors in classics at one of the universities, or classi- 
cal masters of the sixth form in a public school. All such are 
earnestly invited to examine the accompanying piece with as 
much care as if they really were exercising the function of public 
examiner; and send to the editor their verdict, guaranteed by 
their name and status, which, it need hardly be added, it is not 
intended to publish. It is desirable that the examiners should 
assign their respective marks independently, and without mutual 
conference." 

In response to this appeal, "twenty-eight highly competent 
examiners were so kind as to mark this piece of Latin prose." 

The twenty-eight marks distributed themselves as follows: 
45, 59, 67, 67.5, 70, 70, 72.5, 75, 75, 75, 75, 75, 75, 77, 80, 80, 
80, 80, 80, 82, 82, 85, 85, 87.5, 88, 90, 100, 100. 

While two examiners thought the paper met the requirements 
perfectly, four others marked it less than 70. 

Upon discovering so much divergence among these "highly 
competent examiners," Edgeworth entered into a very careful 
study of examinations, giving especial attention to the Civil ' 
Service papers. A full account of his work appears in the 1890 
report of the Royal Statistical Society. It seems unnecessary 
to quote from his tables since his reputation as a statistician and 
economist insures us against any overstatement in his conclusion. 
His most significant conclusion he states thus: "I find the ele- 
ment of chance in these public examinations (India Civil Service, 
Army, and Home Civil Service clerkships of the second order) 
to be such that only a fraction — from a third to two thirds — of 
the successful candidates can be regarded as quite safe, above 
the danger of coming out unsuccessful if a different set of equally 
competent judges had happened to be appointed." 

We surely need no other justification for studying further the 
soundness of our examination system. 

In 1911 Allen Mead Ruggles conducted an experiment in 
marking papers, the results of which are reported in a Master's 
essay submitted at Columbia University. He had twenty 
sixth-grade geography papers rated by eleven graduate students 
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in Teachers College. To indicate the range of marks for each 
paper, the following table, No. 24, is quoted: 

TABLE 24 

Showing the Marks by Each of Eleven Judges, Designated by Letters, 
upon Each of Twenty Geography Papers 

(From Ruggles) 

A.D. 

FEOM 

Papies XESBWHAPKCO Median Median 

1 40 70 53 37 77 53 63 65 37 57 65 54 11.0 

2 21 40 15 10 29 32 30 23 30 60 25 29 8.9 

3 33 30 55 13 53 29 40 50 47 35 55 40 10.9 

4 27 50 60 28 59 60 48 40 90 72 15 50 18.5 

5 24 15 10 14 40 26 26 25 30 34 60 27 10.7 

6 63 50 85 45 58 56 40 60 25 35 55 55 12.3 

7 29 30 45 21 40 47 30 25 20 20 65 37 12.0 

8 59 75 85 38 72 74 55 55 75 45 60 59 11.8 

9 27 25 10 53 20 48 35 60 25 30 70 34 14.8 

10 36 35 15 25 17 65 31 53 25 48 65 39 14.1 

11 39 35 75 40 49 57 35 52 100 44 40 47 13.4 

12 58 45 65 47 56 43 42 50 50 50 60 49 5.9 

13 28 22 25 30 58 50 50 25 59 40 39 15.1 

14 49 50 55 53 44 77 59 45 40 69 50 53 7.6 

15 45 40 78 41 46 74 47 55 25 67 90 49 15.4 

16 57 12 60 20 22 35 46 35 60 26 20 27 15.1 

17 53 50 90 54 93 63 46 60 100 39 100 59 18.6 

18 67 55 90 50 65 65 58 80 65 48 50 60 10.1 

19 43 25 70 40 38 54 44 40 15 43 65 45 11.4 

20 53 35 90 47 56 60 60 53 58 51 45 54 8.6 

Medians 41.5 37.5 60.0 39.0 47.5 56.5 45.0 51.0 38.5 46.5 57.5 48.0 12.15 

A.D.fromM. .. 12.1 13.1 22.7 11.1 17.1 11.1 8.9 10.5 21.1 11.4 14.8 

Median of the A. D.'b on the bottom row 12.1 

Note: The A. D.'s are my own calculations. 

Rather surprising variations are revealed in this table. Paper 
18 has the highest average, 60, and the other papers range down 
to 27, the mark assigned to papers 5 and 16. However, judge 
S considers the entire set of papers worth 60 on the average 
while judge E considers them worth only 37.5. In fact, the 
median of the average deviations from the median among the 
marks assigned the same paper by the several judges is just as 
large as the median of the deviations among the marks of each 
judge upon the several papers. In other words, there is as much 
variation among the several judges as to the value of each paper 
as there is variation among the several papers in the estimation 
of each judge. And the set of papers are of widely different 
values too. 

Another brief experiment was performed at Columbia by H. 
Jacoby. 1 He asked six professors of astronomy to mark a set 
of eleven astronomy papers. The rating was to be done on the 

1 H. Jacoby, The Marking System in the Astronomical Course at Columbia 
College, 1909-1910, Science, 31: 819. 
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scale of 10, with 7 as the passing mark. One judge misunder- 
stood directions, and so the marks of only five are significant. 
These marks are reproduced in Table 25. 

TABLE 25 
The Marks of Five Judges on Eleven Astronomy Papers (Jacoby) 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


Judge A . . . . 


9 


7 


9 


10 


7 


10 


6 


9 


8 


10 


9 




Judge B . . . . 


9 


6.6 


9 


9.4 


6.2 


9.8 


5.8 


9.3 


5.7 


8.5 


9 




Judge C . . . . 


8.5 


7 


8.8 


9.9 


6.7 


9.6 


6.3 


9.7 


9 


9.1 


9.5 




Judge E . . . . 


9 


6 


8 


10 


7 


10 


7 


9 


10 


9 


8 




Judge F ... 


7.3 


6.5 


8 


9.2 


5.9 


9.5 


5.4 


8.8 


8.7 


9 


9 




Maximum 


























Divergence 


1.7 


1.5 


1 


.8 


1.1 


.5 


1.6 


.9 


4.3 


1.5 


1.5 Avg.l 


5 



Thus there appears an average difference of 1.5 on a scale of 
10 between the highest and lowest mark on a paper. In the 
cases of four papers out of eleven some judges would pass the 
student while other judges would fail the student. 

Perhaps the most striking demonstrations of the divergence of 
marks of teachers upon individual papers have been afforded 
by the recent experiments by Starch and Elliott 1 at the Univer- 
sity of Wisconsin. They used two English papers which were 
written by two pupils at the end of the first year of high school 
English, and a geometry paper handed in as an exercise in one 
of the largest high schools in the state. A facsimile reproduction 
was made of each paper and printed upon exactly the same 
sort of paper as that on which the pupil had written it. 
These sheets, along with a set of the questions to which they had 
been given as answers, were sent out to the high schools con- 
stituting the North Central Association of Colleges and Secondary 
Schools with a request that the papers be rated by the ones in 
each high school best qualified to pass upon them, presumably 
the heads of the departments of English and mathematics, 
respectively. The replies were treated in two separate groups, 
those from high schools having 75 as a passing mark, and those 
having 70. The English papers were then rated by a class in 

1 Starch and Elliott, Reliability of Grading High School Work in English, 
School Review, Sept., 1912. Also, Reliability of Grading Work in Mathe- 
matics, School Review, April, 1913. 
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the Teaching of English, in the University of Wisconsin, and 

by a Summer School class of teachers, in the University of 

Chicago. The distributions for all these groups of judges are 

given in Table 25a. 

TABLE 25a 

Distribution of Makes upon Three Papers — "A" and "B" Are English 
and "C" is Geometry. The Marks Are All Combined in Case op 
"C" by Weighting the Marks prom Schools Having 70 as Passing 
Mark, by Adding 3 to Each Mark (Data from Starch and Elliott) 

■o o S M us o S B 

K K j>:0 Co t- ix ■£ O iq 

Saj &. ■ c ■ m m h o m 

CO q (0 3} ™ BI o rn 

M a: e < e ■< ti K e: 3 Km « 

g g gd §o g g S£ S£ g 

J* ■< < < •<<•<<•< 

Ph Ph P- Ph Ph Ch Cm Ph Ch 

28 1 

39 1 

41 1 

44 1 

48 2 

50 to 54 6 

55 to 59 1 1 8 

60 to 64 2 1 1 2 1 17 

65 to 69 1 1 6 6 2 10 19 

70 to 74 2 1 1 5 5 11 7 14 13 

75 to 79 5 6 1 4 24 9 7 20 27 

80 to 84 18 7 6 24 27 13 27 21 11 

85 to 89 24 17 16 31 19 5 24 23 7 

90 to 94 30 15 40 25 7 3 18 9 2 

95 to 100 9 4 22 7 1 1 1 

Total 91 51 86 97 91 51 86 98 116 

Medians 88.3 87.2 92.4 86.7 80.4 78.8 84.5 80.5 70.0 

Med. Dev 4.5 4.2 3.0 4.3 4.4 5.8 4.2 5.8 7.5 

Note: The median deviations are my own calculations. 

These tables may be allowed to speak for themselves. We 
need to point out only two features: There is a difference of 
more than five between the median mark given by the high 
school teachers and the class in the Teaching of English in the 
case of either English paper; and the chances are about even 
that, in the case of any group of judges, the paper will be changed 
five points or more when given from one teacher to the next for 
rating. I wish to call attention to these two facts because of 
their similarity with those revealed in the study of the New 
York Regents Examinations to be reported a little later. 

I shall report upon but one other experiment with marking. 
That experiment is described at the close of Gray's study to 
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which attention was directed earlier. Gray secured sets of 
examination answer papers in mathematics and in English from 
an Indiana high school, and asked five other competent persons 
besides the class teacher to rate them. These five ratings along 
with the rating of the class teachers who had furnished the 
papers are given in Gray's book. I shall give in Table 26 only 
the average mark for each paper and the average of the varia- 
tions from the average, and the average of the marks which 
each judge gave to the entire set. The judges were experienced 
teachers, and the passing mark was understood to be 70 in each 

case. 

TABLE 26 

The Average and Average Deviation Among Six Judges op a Set of 
Mathematics and a Set op English Papers, and the Average op 
all Marks Given by Each Judge (Gray) 





Mathematics 






English 




Papers 


Avg. 


A.D. 


Papers 


Avg. 


A.D. 


1 


66.5 


8.0 


1 


59.3 


9.6 


2 


77.3 


4.2 


2 


82.5 


9.3 


3 


86,3 


3.1 


3 


82.7 


7.5 


4 


28.5 


9.2 


4 


77.8 


10.8 


5 


45.5 


14.7 


5 


75.6 


6.7 


6 


57.0 


3.0 


6 


66.8 


11.8 


7 


76.6 


7.2 


7 


71.2 


10.8 


8 


83.5 


7.3 


8 


79.3 


7.6 


9 


67.5 


10.2 


9 


79.3 


7.6 


10 


78.8 


4.2 


10 


77.1 


5.3 


11 


44.5 


7.0 


11 


87.0 


12.0 








12 


85.3 


6.0 


Average , 




7.1 


13 


68.3 


11.2 








14 


76.5 
76.1 


8.2 
10.5 








15 








16 


74.0 


8.3 


Average 


of Each Judge's 


Several 


17 


65.3 


9.5 




Marks 




18 


62.8 


16.8 








19 


77.8 


8.5 


Judge A . 


78 . 7 (original teacher) 


20 


79.8 


5.4 


Judge B . 


74.0 










Judge C . 


61.4 




Average , 




9.2 


Judge D 


65.5 
















Judge E 


75.5 










Judge F 


58.0 




Average 


of Each Judge's 
Marks 


Several 








Judge A 


.... 80 . 3 (original teacher) 








Judge B 


83.7 










Judge C 


78.5 










Judge D 


54.0 










Judge E 


70.0 










Judge F 


79.5 





Note: The A. D.'a are my own calculations. 
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In this table we have a greater variation among marks than 
was found by Starch and Elliott. In the marking of the mathe- 
matics papers the judges varied about as much as with the 
geometry paper in the above study, but in marking the English 
papers the variation was nearly twice as great as Starch and 
Elliott found. There is a difference of 20.7 points on the average 
between the marks of judges A and F of the mathematics set, 
and a difference of 29.7 points between the averages of judges B 
and D of the English set. In fact, judge D failed all but one of 
the papers, while judge B passed all but one, in the English set. 

In all of the above studies we see very serious lack of standards 
among teachers. It is true that in all these cases the judges 
were selected from an area where no especial effort had been 
made to standardize the judgments. On this account I under- 
took to measure the variation between the marks of the teachers 
in New York state on the one hand and the Regents on the other. 
Here is a place where through several decades examinations have 
been given regularly throughout the state, and where there has 
been not only the opportunity but the necessity of standardizing 
the judgments of the many teachers as far as the present type of 
examinations accomplishes such standardization. It is in such 
a situation that the greatest care is exercised by the teachers 
because they recognize that they are themselves judged some- 
what by the correlation between their own and the regents' 
marking. 

Before giving the results of this study I wish to indicate some- 
thing of the extent of this system of examinations and some of 
its tendencies. For the series of years from 1889 to 1895 in- 
clusive, Thomas O. Baker 1 has tabulated the data found in 
Table 27, page 58, set opposite those years, and the reports of 
the Department of Education of the State of New York fur- 
nished the data for the years 1911, 1912, and 1913. 

From this table the extent of the system is apparent. The 
two tendencies to which attention is called are the constantly 
increasing per cent of papers which the regents have passed, 
and, at the same time, the constantly increasing per cent of 
papers rejected by the regents of those passed by the teachers. 

1 Thomas O. Baker, An Analysis of the Regents' Examinations in Relation 
to Secondary Schools, Doctor's essay, New York University, New York City, 
1896. 
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TABLE 27 



Data Concerning New York State Regents' Examinations a | 

ill 

13 1« Pa Pg= I*. 



YEAS s r aig 1L II PJ 



3o «g SB! 2gg 9k S« a b«S b h »o fi 



g£ 



III *M*l9 * 



1889. . 

1890. . 
1891.. 
1892. . 
1893.. 
1894. . 
1895.. 

1911.. 
1912. . 
1913.. 



M 3 ft. 3 fc w ft 3S p. "^ g fa ^ « 0.** Kg< w ra 

304 193,197 107,149 99,079 86,048 8,070 51 44.5 7.5 

311 201,488 117,257 107,915 84,231 9,342 53 41.3 7.9 

358 244,979 153,788 146,565 91,191 7,223 59 37.2 4.7 

357 278,907 178,516 155,869 102,391 20,647 56 36.9 11.7 

393 302,471 185,677 165,676 116,794 20,001 55 38.3 10.8 

410 357,908 23S.319 211,533 119,589 26,786 59 33.4 11.2 

468 388,945 259,932 231,231 129,013 28,701 69 33.0 11.0 

452,703 363,708 309,608 88,995 54.100 68.3 19.6 14.9 

327,043 273,624 233,768 53,419 39,856 71.5 16.3 14.6 

392,252 319,582 279,035 72,670 40,547 71.1 18.5 12.7 



There is, of course, a corresponding decrease in per cent of those 
rejected by the teachers in the schools. These two tendencies 
seem to me significant. While an ever-increasing number of 
pupils in the high schools of the state are able to meet the re- 
quirements of the examiners, the difference in standards of 
judging papers by teachers and examiners grows ever greater. 
While the requirements for high school teachers are constantly 
being increased, their judgment of the value of examination 
papers is being more and more rejected. At the same time 
that this tendency is present, the custom of accepting without 
reexamination the ratings of certain well known teachers is 
growing among the regents' examiners. This latter custom is 
used to such an extent, in fact, that in the report of 1913 above, 
if only the papers were counted which the regents reexamined, 
only 60.1 instead of 71.1 per cent would be found to be passed 
by the regents. In short, we seem driven by the facts here 
revealed to the conclusion that as the work in the high school 
becomes richer, the examination paper becomes a less satis- 
factory means of determining promotion, and we feel more and 
more the need of objective standards which are capable of con- 
sistent interpretation by all good teachers, as a means of measur- 
ing progress. 

This situation seemed to call for still further investigation. 
I therefore computed Table 28, from data contained in the 1913 
report, State Department of Education, pages 826 to 834: 
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TABLE 28 

The New Yobk State Regents' Examinations in High School Subjects, 
January, 1912, and June, 1912 



English 71,902 

German 22,459 

French 9,689 

Latin 32,522 

Mathematics 79,786 



Science. 

Hist, and Soc. Sci.. 

Commercial 

Drawing 

Music. 



61,989 

46,344 
33,517 
29,848 
2,485 



6?e5 



77.7 
80.6 
78.7 
74.2 
85.8 
83.0 
77.7 
86.5 
81.7 
89.0 



SB 

as 

13.2 
22.3 
19.4 
21.3 
25.8 
14.2 
17.0 
22.3 
13.5 
18.3 
11.0 



h Es s R 
go „ a 

h K a 3 
f^ o Wo OP 

s?a s?ass 



Regents ' Ratings 

f Of those which the regents 

reexamined) 



2 SS 



80.5 
62.9 
67.7 
63.9 
65.1 
76.9 
71.8 
61.5 
75.1 
80.1 
74.6 



7.3 
19.0 
16.0 
20.1 
25.7 

9.2 
13.5 
20.9 
13.2 

1.9 
16.2 



OtherSubjects 1,812 

Total 392,252 81.8 18.5 71.1 12.8 



11 



38.2 
31.0 
35.3 
37.2 

27.5 
28.2 
44.9 
24.9 
19.9 



45.5 
39.7 
46.8 
42.7 
29.9 
36.2 
47.2 
29.6 
39.4 
22.8 



28.0% 

19.8 

30.6 

19.7 

21.6 

30.5 

21.3 

21.7 

32.2 

35.6 



4.1% 
2.3 , 
1.6 

2.3 
11.3 

5.8 

3.3 

3.8 

3.5 
21.7 



TOTAL 



100 



Con 

Sue. 



Gem rM>gt Lai firmed 



39.6 34.7 21.5 4.2 

RUf Sci Oaw ~iV». 
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Fig. 3. Representing the per cents of papers in the various departments 
of study, marked as follows by the regents in 1912: Lowest section, failed; 
next section, 60 to 74; next section, 75 to 89; top section, 90 to 100. The 
apparent discrepancy between the total and the several subjects is due to 
the fact that certain subjects were more liberally excused from reexamination 
by the regents than others. 
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Examination of Table 28 reveals, first of all, a wide variation 
in the percentages of papers passed in the various subjects, the 
failures in commercial subjects, for example, being practically 
double the percentage of those in English. The distribution of 
the various ratings, four groups being designated, (1) below 60, 
(2) 60 to 74, (3) 75 to 89, and (4) 90 to 100, shows no special sim- 
ilarity among the various subjects. These data are represented 
graphically in Figure 3, and there we see at a glance how un- 
equal these examinations must be considered as tests of student 
ability. If the contention held so generally by students of educa- 
tion to-day has any validity, namely, that ability as represented 
by marks in school should be distributed in any large normal 
group of pupils approximately according to the probability sur- 
face of frequency, then these examinations as marked at present 
either by teachers or regents cannot be held to test at all ade- 
quately the abilities of the pupils in the, several subjects of study. 

Table 28 discloses another fact of at least equal importance. 
Consider for a moment the two columns, "per cent rejected by 
the teachers" and "per cent rejected by the regents of those 
passed by the teachers." We find the columns running thus: 

English 13.2 7.3 History 17.0 13.5 

German 22.3 19.0 Commerce 22.3 20.9 

French 19.4 16,0 Drawing 13.5 13.2 

Latin 21.3 20.1 Music 18.3 1.9 

Mathematics 25.8 25.7 Other Subjects. . . 11 16.2 

Science 14.2 9.2 

This phenomenon seems hard to understand. At first thought, 
one would suppose that the greater the per cent failed by the 
teachers, the fewer additional papers would be failed by the 
regents. We find, on the contrary, that with remarkable con- 
sistency, the greater the per cent failed by the teachers, the 
greater the additional per cent failed by the regents. The rule 
is not even violated in the case of mathematics, which by all tra- 
dition offers the greatest possibility of exactness in marking 
papers. If great care in speech is not demanded, we may say 
that in nearly all the subjects, the regents' examiners reject the 
judgment of the teachers to just about the same degree that the 
teachers reject the judgment of the pupils. 

In the graphical representation of these data given in Fig- 
ure 4 we see how closely the two areas correspond not only in 
extent, but in shape as well. The only explanation which occurs 
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to me for this is the absence of all harmonious standards among 
the examiners of the various subjects. When the questions are 
prepared by the examiners, a certain standard of excellence in 
high school work is set up in each subject. The questions are 
an attempt to measure ability by these several standards. When 
the thousands of teachers over the state get the questions with 
the answer papers from their respective classes, if the questions 
seem easy as measured by the small number of their pupils whom 
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Fig. 4. Showing the percentages, by departments, of all papers written 
in the regents examinations in January and June 1912, which were failed by 
the teachers, and also the percentages failed by the regents of those passed 
by the teachers. 

they feel compelled to fail, it is an evidence that the standard 
held by the examiner in that subject is not high as compared with 
the standard of the teachers of the subject throughout the state. 
If, on the other hand, the questions seem very hard to the teach- 
ers, and they must fail a larger per cent of the pupils, it is evi- 
dence that the standard of the examiner in that subject is higher 
than that held by the teachers. Consequently, when the teach- 
ers mark the papers by their own standards, the examiner whose 
standard is lower than theirs finds fewer papers to reject among 
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those passed by the teachers than does the examiner whose stand- 
ard is higher than that of the teachers. 

Whether or not this explanation is the true one, it seems cer- 
tain that if the examinations were as valuable an instrument for 
standardizing the work of the schools of New York as its advo- 
cates claim, this situation would not exist. 

As an interesting sidelight upon the variation among these 
standards as they exist in the various high schools of the state, 
Table 29 is presented. This was prepared from the 1913 report 
of the Department of Education from the table beginning on 
page 830. Not all the schools were used to make this distribu- 
tion, but the first 393 were taken with no omissions, and they 
form thus a sufficiently large random selection. 

TABLE 29 

Distribution by Schools, of the Percentages op Papers Passed by the 
Regents op Those Marked Passed by the Teachers. All Academic 
Examinations in 1912. First 393 Schools in the Alphabetical List 



Per cent of Papers Passed 


NUMBER 


Per cent 


by Regents op Those Passed 


OF 


of 


by Teachers 


Schools 


Schools 


40 to 49.9 


2 


.51 


50 to 59.9 


10 


2.55 


60 to 61.9 


7 


1.79 


62 to 63.9 


6 


1.53 


64 to 65.9 


8 


2.04 


66 to 67.9 


9 


2.29 


68 to 69.9 


8 


2.04 


70 to 71.9 


20 


5.09 


72 to 73.9 


14 


3.56 


74 to 75.9 


12 


3.06 


76 to 77.9 


24 


6.11 


78 to 79.9 


11 


2.80 


80 to 81.9 


25 


6.36 


82 to 83.9 


32 


8.14 


84 to 85.9 


24 


8.65 


86 to 87.9 


35 


8.91 


88 to 89.9 


33 


8.40 


90 to 91.9 


40 


10.20 


92 to 93.9 


33 


8.40 


94 to 95.9 


20 


5.09 


96 to 97.9 


9 


2.29 


98 to 100 


11 


2.80 


Average 




82.74 


Middle 50% 




76. 3 to 91.0 



This table should be read as follows: .51 per cent of the schools of New York 
state have fewer than 50 per cent of the papers passed by the regents which 
were passed by the teachers; 2.55 per cent of the schools have 50 to 59.9 per 
cent passed by the regents; one fourth of the schools have less than 76.3 
per cent of their papers accepted which they had passed, while another one- 
fourth of the schools have more than 91 per cent accepted. 
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The significance of this table is not great. The schools are 
so different in size that any distribution of schools as units must 
be interpreted guardedly. However, in conjunction with the 
table giving percentages of papers failed, it is a certain indica- 
tion that there is as little agreement among the teachers of the 
state concerning standards hoped for by the regents themselves 
in the examinations, as there is among the examiners of the vari- 
ous subjects. 
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Fig. 5. Surface of frequency of schools having the various per cents of their 
claimed papers passed by the regents. (Data, Table 29.) 



The data of Table 29 are represented graphically in Figure 
5. If the spread here indicated persists at this time after so 
many decades of service of the examination system, it cannot 
be maintained that such a system is a very effective means of 
standardizing work either among the schools or the teachers. 

Turning to the study of the marks themselves which are on 
file in the examinations division of the Department of Education 
at Albany, I must first express my appreciation of the courtesy 
extended to me while examining the records. Not only was free 
access to all the records given, but also every facility for most 
readily transcribing the data was afforded. 
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The study was confined to four questions : 

(1) What is the distribution of differences between the marks 
given by teachers and examiners on the same papers? 

(2) What is the distribution of marks of teachers on papers 
marked a certain figure, say 75, by the examiners? 

(3) What is the distribution of marks of teachers on papers 
failed by the examiners? and the related question, What are the 
examiners' marks on papers marked near the failing point by the 
teachers? 

(4) What differences if any exist between the standards main- 
tained by the small high schools and the large ones? 

A detailed description of the system of examinations carried 
out in New York State seems unnecessary. Questions are sup- 
plied to all the high schools every half year on practically each 
year of work in each subject taught in the high school. The 
papers are first graded by the teachers in the schools, and then 
those marked 60 or above are sent to Albany to be reexamined 
by the examiners. This is, indeed, a task for a small army of 
readers. With the development of the department certain cus- 
toms have become quite fixed. Significant among these the fol- 
lowing may be mentioned as bearing most directly upon the 
findings of this study: 

(1) Before the readers start the rating of any set of papers, 
all those who are to help with any given set go over several papers 
together so as to gain as great uniformity as possible before be- 
ginning to mark. 

(2) Any paper marked failed by the reader which was passed 
by the teacher, must be read by another reader before it is finally 
failedi. 

(3) Where the difference between the examiner's mark and 
the teacher's mark is 3 or less, the examiner gives the same mark 
to the paper that the teacher had given, except in cases where the 
examiner's mark is below 60. In those cases, the examiner holds 
to his own mark, thus failing the paper. 

(4) The ratings of certain teachers, and afterwards certain 
schools, come to be accepted, and the papers rarely if ever re- 
examined. 

These traditions must be kept in mind in connection with all 
phases of the study. 

The ratings of the June, 1913, examinations were chosen since 
they were the most recent as well as the most accessible. Among 
the subjects the following were selected as perhaps the most rep- 
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resentative: English Grammar, Latin II, Elementary Algebra, 
American History and Civics, Physics, and Elementary Repre- 
sentation. The bases for the selection of schools were as follows: 

(1) The schools were taken in alphabetical order beginning at 
the first. 

(2) All "Union Schools" were thrown out. 

(3) All large schools (those the record for whose ratings re- 
quired more than one book) were thrown out. 

(4) All schools having fewer than three English Grammar 
papers were thrown out. 

(5) All schools whose ratings in English Grammar were ac- 
cepted without reexamination by the regents were thrown out. 

When the list of schools meeting these requirements totaled 
36, no more were added, but without exception, all the data in 
these thirty-six were used. 

For the five large high schools to use for the brief comparative 
study, the five double books, which came first to hand, were 
taken. I have not been asked to withhold the names of these 
schools, but it seems only courteous to do so. 

The following distribution, Table 30, furnishes an answer to 
the first question above: What is the distribution of differences 
between marks given by the teachers and the examiners to the 
same papers? The facts are represented graphically in Figures 
6 and 7. 
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TABLE 30 

The Distribution op Differences Between Teachers' Marks and 
Regents' Marks on the Same Papers (36 Schools) 
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Before commenting upon this table, another one whose data 
bear upon the meaning of these distributions must be given. 
This is the table of distributions of the teachers' marks on papers 
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Fig. 6. Showing the frequency of 
differences between the marks given 
by the teachers and by the regents 
on the same papers. To the left of 
zero are those on which the regents 
marks are higher, and to the right 
are those on which the regents marks 
are lower. (Data in right hand col- 
umn of Table 30.) 
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Fig. 7. Same'as'Fig. 6, with the 
differences on the two sides of the 
zero point added together. 
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which the regents marked failed. This information is given in 
Table 31, and is represented graphically in Figure 8. 

TABLE 31 

Distribution of Teachers' Marks on Papers Which the Regents, upon 
Reexamination, Marked Failed 
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This table reads as follows: Of the thirty-six papers marked failed by the regents in English Grammar, the 
teachers had marked fifteen or 41.7 per cent of them at 60, etc. 




Fig. 8.% The surface of frequency ^of ^teachers' marks on papers failed by 
the regents. (Data Table 31.) 
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It is necessary to discuss these two tables, Nos. 30 and 31, 
together. The marks on papers which the regents failed enter 
into the composition of both tables. However, since no marks 
were put upon them by the regents except the failure mark, we 
cannot tell how much below 60 they would have been reduced 
had we known the mark. As it was, we arbitrarily called all 
papers which were marked failed, 59. Thus, when a teacher 
had marked a paper 60, the difference between the teacher's 
mark and the regents' mark is called one if the regents fail the 
paper. In the same way, the difference is called eight on a paper 
marked 67 by the teacher and failed by the regents. This will 
tend to a reduction from the real differences, and the tables of 
differences understate the facts somewhat. Since 392 of the 
total 2,463 papers used were failed by the regents, this influence 
is considerable. Since the median reduction of marks on all 
failed papers is 6 points even when all failed papers are rated at 
59, it is fair to assume that at least half of those differences which 
are in the 1, or 2, or 3, column due to failure, represent really 
differences as great as 5. It will be observed, also, that of the 
130 papers reduced 1 point by the regents, 90 of them were due 
to failures; of the 76 reduced by 2, 34 were due to failures; of 
the 60 reduced by 3, 18 were due to failures. 

With these facts in mind we may now briefly examine the 
tables separately. It should be said in the first place that the 
theoretical distribution made to allow the proper spread of the ex- 
tremes and the six middle steps, three on either side of the zero 
point, was not constructed with mathematical precision. There 
was not enough of the distribution given to allow of precise 
determination of its form. Without this, the undistributed por- 
tions could be spread by approximations only. It is, however, 
sufficiently accurate for all practical purposes, and if it errs at 
all, the error is in favor of making the differences smaller than 
they really are. 

Considering the distribution first which takes account of the 
differences on both sides of the zero point, we find the median 
difference is a reduction by the regents of 1.3 points, with the 
middle 50 per cent of the cases lying between an increase of .8 
points and a decrease of 7 points. It thus appears that there is 
one chance in four that a paper marked, say, 70 by the teacher, 
will be marked 71 or higher, another chance in four that it will 
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be marked 62 or lower, and one chance in two that it will be 
marked between 62 and 71, with an even chance that it will be 
changed at least 3 points. When we consider this in connection 
with the rather narrow range within which the marks lie (less 
than 25 per cent being above 74 on all the papers marked in 1912) , 
it seems a serious situation. The median mark given by the 
regents in 1912 to the 319,582 papers examined was probably 
65, although the method of reporting the figures does not allow 
of absolute determination of the median. Thus the median 
paper of the group has less than three chances in four of being 
passed by the regents. Furthermore, it will be noted by Table 
31 that 25 per cent of the papers which were failed had been 
marked 70 or more by the teachers. 

Marked variation among the subjects exists in several particu- 
lars. Notice first the percentages of papers on which the dif- 
ference between the teacher's mark and the regents' mark is 
three or less; then notice the percentages of papers reduced 
by 10 or more; finally notice the percentages of failures in each 
subject. These data are summarized in the following table, 
No. 32. 

TABLE 32 
Significant Variations Among the Various Subjects 

Percentages of Papers Percentages of Percentages of 

Having a Difference of Papers Reduced 10 Papers Failed by the 

3 Points or Less between Points or More by Regents after being 

Teacher's Mark and the Regents Passed by the 

Regents' Mark Teachers 

English Grammar 47.5 11.1 9.58 

Latin II 50.9 14.8 26.35 

Algebra 68.4 5.0 3.65 

Am. Hist, and Civics .. . 47.5 8.0 17.80 

Physics 33.9 18.7 8.37 

Elem. Representation ... 30 . 4 35 . 4 36 . 70 

Totals of all subjects 49 . 82 13 . 80 16 . 30 

In Table 32, compare, for example, the figures given for two 
standard subjects such as algebra and physics. Algebra has 
more than twice as many papers where differences cluster around 
zero, and less than a third as many where differences of ten or 
more exist, and less than half as many failures. Judged by these 
figures, the grading of the teachers of algebra is more than twice 
as reliable in the eyes of the regents as that of the physics. In 
elementary representation the situation is even worse than in 
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physics. In fact, the teachers must have had little notion of 
the standard to be applied to the papers when more than a third 
of them were reduced by 10 or more points, and 9.6 per cent 
of them reduced by 20 or more points. 

The percentages of failures in these subjects are seen to vary 
much more than the percentages recorded in Table 28 for math- 
ematics, English, Latin, science, etc. The variations there 
seemed very large, but if the variations are due to varying stand- 
ards among the judges, it is only natural that in the individual 
subjects we should find this increased as the last table reveals. 
These extreme differences in standards of individuals are largely 
concealed in the group of subjects taken together to make, say, 
English. The uncertainty which exists in the mind of the 
teacher as to the outcome of the visit of her papers to Albany, 
however, is determined by the hazard of the individual subject, 
and she has no way of knowing the outcome. She may lose 
them all. Indeed, several schools were encountered in this study 
which had every paper in certain subjects rejected by the regents. 

In this connection I may say that in collecting the data for 
this entire study, the distributions were first made for each school 
separately. Many most interesting things were revealed thereby 
but the tables become so very long it seems scarcely wise to pub- 
lish them. One illustration of striking nature may be noted. A 
certain school had sent in seventeen English grammar papers. 
On these papers the regents raised two marks by 7 points, raised 
one by 5, left twelve unchanged, lowered one by 5, and lowered 
the other by 6. None were failed. This made all round the 
best record of any school in English grammar so far as grading 
was concerned. When we came to the same school in Latin II 
we found sixty-one papers. On these papers no marks were 
raised, one was left unchanged, two reduced by 1, one reduced 
by 2, one by 3, three by 4, five by 5, three by 6, three by 7, five 
by 8, one by 9, twenty-six by from 10 to 14, and the other ten 
by from 15 to 19 points. Twenty-eight papers were failed. This 
made all round the worst record of any school in Latin II. 

Another illustration seems worthy of note. Among the alge- 
bra papers, which proved on the whole subject to the least varia- 
tion of any, one school sent in twenty-two and had all but six 
of them reduced by 10 points or more, with five of them reduced 
by 20 or more. 
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Scores of other anomalies in grading can be seen at a glance 
from these separate distributions by schools, but the one impres- 
sion left by them all is that judgments as to the worth of exami- 
nation papers such as are written to-day are too variable to per- 
mit of substantial justice in making awards of things of such 
supreme worth to students, by means of such judgments. 

The second question which this study was devised to answer, 
namely, "what is the distribution of teachers' marks on papers 
rated at 75 by the regents?" is in a sense a detail of the more 
inclusive study of differences above. However, it seems a little 
more definite, and certainly reveals some few additional facts. 
Then, too, a reduction from 100 to 75 as a mark on a paper seems 
a greater reduction than from 85 to 60. It seemed desirable 
also to use this more exact form of differences in comparing large 
with small schools. 

The following distributions, Table 33, need but little comment. 
The total of the thirty-six small schools is given first, then the 
total of the five large schools. All the papers in English, mathe- 
matics, Latin, and science in both groups of schools were used, 
and the German papers were added to the group of five large high 
schools to make the numbers in the two groups more nearly 
alike. It must be noted that the regents did not reexamine all 
papers from the large schools. Approximately, the following 
omissions are correct: 

Four fifths of science papers from School 1. 
Four fifths of mathematics papers from Schools 3 and 5. 
Two thirds of German papers from Schools 3, 4, and 5. 
One third of Latin papers from School 3. 

It must be noted also that in making the theoretical distribu- 
tions of the large groups at 75 to take into account the custom 
of changing few or no marks less than 3 points, the group at 
75 in the case of the 5 schools was not large enough to smooth 
out the surface at 76 and 77 alone, hence no changes were made 
from the actual figures in the other steps. 
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TABLE 33 

Distribution of Teachers' Marks on all Papers in English, Latin, 
Mathematics, and Science, which Were Marked 75 by the Regents; 
Thirty-Six Small Schools, and Five Large Schools 
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Examining first the distribution and corresponding surface of 
frequency, Table 33 and Figure 9, of the marks for the thirty- 
six schools, we note that the median mark is at 75 in the actual 
distribution, and at 77 in the theoretical distribution. The 
middle 50 per cent lie between 74 and 82. Thus it appears that 
25 per cent of the papers marked 75 by the regents had been rated 
at 82 or above by the teachers. One paper had been rated at 
100, another at 60, while enough were rated at 80, 85 and 90 to 
show modal tendencies for those points. 
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Fig. 9. Surfaces of frequency of teachers' marks on papers marked 75 by 
the regents. For 36 schools above; for 5 large schools below. 



In the case of the five large schools we find the median mark 
at 80, with the middle 50 per cent extending from 75 to 85. Thus 
one fourth of the papers were reduced by 10 points or more, while 
the median paper was reduced by 5. If it were not for the modal 
points at 75, 80, 85, and 90, the marks would seem spread with 
remarkable indifference over a space of nearly 20 points. On 
the whole, the large schools present a decidedly less close group- 
ing than do the small schools. When we consider that these 
large schools probably employ only teachers of special training 
and successful experience, we are forced to either of two conclu- 
sions, that training and experience do not lead to familiarity 
with the standards of the regents, or that "familiarity breeds 
contempt." 
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In the light of the results shown here, we feel surprise that the 
regents should have found cause for not reexamining so many- 
sets of papers, only in part, from these schools. If these schools 
are reaching approximately satisfactory standards in marking 
papers, then we must conclude that the regents do not find fault 
with a practically rectangular distribution of marks from 72 to 
90 on papers of equal merit as judged by their own examiners. 
Or is it possible that the extent of the differences between their 
own marks and those of the teachers has not come to their 
attention? In any case, the situation does not seem very satis- 
factory. 

To anticipate the possible criticism that these figures are not 
very reliable on account of the smallness of the numbers of papers 
which enter into the computations, I have determined mathe- 
matically, from formulae in common use, 1 the extent of unrelia- 
bility of the medians and median deviations from the medians, 
in both distributions, the one for thirty -six schools, and the one 
for five schools. In the case of the distributions for thirty-six 
schools the formula gives a mean square deviation of the diver- 
gence of the true median from the obtained median (in this case 
77) of .348. By this we are assured that if marks from all the 
schools of the state of which these thirty-six are typical had been 
secured, the chances are more than two to one that the median 
of the total distribution would not differ from the median of the 
present distribution by more than .348 points. The chances are 
more than two hundred to one that the true median is not more 
than 78, nor less than 76. 

Again in the case of the thirty-six schools, it is found that the 
mean square deviation of the divergence of the true median devia- 
tion from the obtained median deviation (in this case 3.8) is .22. 
We know from this that if all the small high schools had been 
used, the middle 50 per cent of cases would not in more than one 
chance in three have been increased or decreased by more than 
.44, not in one chance in ten thousand would it be found to equal 
that of the five schools. 

Applying the same formula for unreliability to the measures 
found for the five schools, we find that the mean square deviation 
of the divergence of the true median from the obtained median 



1 Thorndike, Theory of Mental and Social Measurements, page 195. 
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(in this case 80) is .52. Thus there is less than one chance in 
three that the median mark here found is less or greater than the 
true median for all such schools by as much as .52 of a step. The 
chances are, indeed, less than one in fifty thousand that if all 
such schools had been used the median mark would have been 
found as low as 77. By the formula for the unreliability of a 
difference between two central tendencies 1 it can be shown that it 
is out of the range of possibility that the medians for the two 
distributions would be changed enough by the use of larger 
numbers of schools to become equal. 

Similarly determined, it can be stated that there is only one 
chance in three that the middle 50 per cent of the complete dis- 
tribution for the five large schools would be found to include 
less than 9.34 or more than 10.66 steps. 

Thus the measures found are seen to be quite reliable so far 
as the number of cases used is concerned. There may be some 
question still as to whether these five schools (a number too small 
to be considered a fair random selection) may not be an unfair 
representation of the large schools of the state by the very reason 
that they are still among those whose papers are reexamined by 
the regents. This cannot be answered with certainty, since we 
cannot compare the differences of ratings of these schools with 
those which the regents do not reexamine. There are two con- 
siderations to offer in this connection, however. One is that 
since I was entirely strange to the conditions in all the high schools 
chosen as well as all other New York high schools, practically, 
no prejudice could have entered into the case if I had possessed 
any. The other and far weightier consideration is that in the 
case of four of the five schools certain sets of papers were looked 
over only in part, the inference being that the rating was being 
found satisfactory, and, therefore, the teachers' marks could be 
accepted for the remainders of the sets of papers. Thus it 
appears that these schools share in the confidence of the regents 
and are surely not a blacklisted lot which I happened to select. 
Furthermore it develops upon inspection that the one of these 
schools which had none of its papers exempted from reexamina- 
tion made the best showing in the distributions of differences. 



• Thorndike, Theory of Mental and Social Measurements, page 193. 
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Notice the following table, No. 34, of medians and limits of middle 

50 per cents. 

TABLE 34 

Teachers' Marks on All Papers in English, Latin, German, Mathe- 
matics, and Science Marked 75 by the Regents. From the Distri- 
butions from Five Large High Schools Taken Separately 

Median Mark Limits op Mid- 
dle 50% 

School 1 80 75 to 84 

School II 75 73 to 81 

School III 82 79 to 87 

School IV 80 75 to 85 

School V 82 76 to 86 

If grading approximating that of the regents is the thing de- 
sired, it seems singular that School II should be the one school of 
these five to have no papers exempted. School III, though its 
record is the worst of the group, so far as the regents' marks set 
the standard, had papers exempted from reexamination in Latin, 
German, and mathematics. Presumably, too, the decision not 
to reexamine the remainder of the sets of papers from this school 
was reached by the reexamination of those which go to make up 
the major part of the record here made. 

It seems, then, that so far as our meager evidence goes, these 
schools are typical of all the large schools, including those which 
are even more largely exempted. 

One further type of comparison seems worth while. If the 
large schools have advantages surpassing the small schools in 
any subjects, they are the sciences, perhaps, which require for 
their proper study expensive equipment and laboratories which 
are seldom furnished in small schools. Since we had made a 
distribution of differences between teachers' marks and regents' 
in physics for the small schools, it seemed fitting to make a simi- 
lar distribution of the same subject for the large schools. Since 
School V was largely exempted in this subject, we used only 
schools I, II, III, and IV. In Table 35 the distribution from the 
four large schools is placed side by side with that from the thirty- 
six small schools as it appeared in Table 30. The distributions 
of failed papers with the marks which the teachers had given 
them are also given, Table 36, page 79. 

These tables need little comment. In the distribution of 
differences it will be seen that the large schools make the poorer 
record so far as tallying with the regents is concerned. Of the 
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TABLE 35 

Distribution op Differences Between Teachers' Marks and Regents' 
Marks on the Same Papers in Physics; Thiety-Six Small Schools 
and Four Large Schools Listed Separately 
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263 physics papers from the small schools 69 or 26.3 per cent 
were marked failed by the regents. Of the 160 papers from 
the large schools 68 or 42.5 per cent were marked failed by 
the regents. And on the papers which were marked failed 
the teachers had given about the same range of marks, the small 
schools faring a little the worse. On the whole, physics does not 
serve to strengthen the case of the large schools. 

Only one other question remains of those which the study was 
undertaken to answer: What is the fate of papers marked near 
the failing point by the teachers when they come into the hands 
of the regents? To answer this question the following method was 
used: AH papers which were marked 60, 61, 62, 63, 64, or 65 
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Distribution op Teachers' Marks on Papers Marked Failed bt the 
Regents in Physics; Thirty-Sec Small Schools and Four Large 
Schools Taken Separately 
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by the teachers were taken. The departments of English, 
mathematics, Latin and science were used, no papers in them 
being omitted. The same thirty-six schools constituted the list. 
The table of distributions follows in Table 37, page 80, and the 
facts are represented in Figure 10. 

It will be observed that of these low papers, the regents fail 
41.3 per cent. In the case of Latin they fail more than half, 
while in science they fail more than three fifths. Only 5.64 
per cent of the marks are raised above 65, while more than 
one half of those left within the original range of marks, 60 to 65, 
are found at 60. 

The chief interest in this table, No. 37, lies in the report common 
among high school teachers that they push up the grade 
on doubtful papers to "take a chance" on their passing. From 
this table one would judge that the report is true. Being true 
it is indicative of an attitude on the part of the teachers toward 
the regents' examinations which is not altogether good. If the 
teachers felt the confidence in the examinations which should 
be felt, they would give them in the same spirit in which teachers 
elsewhere give examinations of their own making. If weakness 
to the extent of that indicated by a grade of below 60 were 
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TABLE 37 

Distribution of Beqents' Makes on Papers Which the Teachers Had 
Marked at 60, 61, 62, 63, 64, and 65, all Lumped Together. From 
Thirty-Six Schools 

Begents' English Latin Mathemat- Science TotaI( 
Marks 

No. % No. % No. % No. % No. % 

Failure 176 32.7 151 54.1 114 33.6 119 60.7 560 41.3 

60 181 33.6 82 29.3 97 28.7 14 7.1 374 27.6 

61 27 5.0 9 3.2 14 4.1 5 2.6 55 4.0 

62 25 4.6 9 3.2 23 6.8 5 2.6 62 4.6 

63 27 5.0 7 2.5 16 4.7 11 5.6 61 4.5 

64 36 6.7 4 1.5 20 5.9 13 6.6 73 5.4 

65 30 5.6 15 5.4 15 4.4 17 8.7 77 5.7 

66 6 1.1 6 1.7 2 1.0 14 1.0 

67 5 .9 3 .9 8 .6 

68 6 1.1 4 1.2 10 .7 

69 3 .6 2 .7 7 2.0 5 2.4 17 1.3 

70 10 1.9 1 .4 18 5.3 2 1.0 31 2.3 

71 1 .3 1 .5 2 .1 

72 

73 1 .5 1 .07 

74 

75 5 .9 1 .3 1 .5 7 .5 

76 1 .2 1 .07 

Totals 538 280 339 196 1353 

This table reads as follows: Of the 538 English papers which the teachers 
had marked from 60 to 65 inclusive, the regents marked 176 as failed; 181 at 
60; 27 at 61, etc. 

revealed in the examination, the teacher would be glad to take 
that information at its face value, and fail the paper. But 
where the spirit grows up that makes the aim of the teachers 
to get as many students "through the regents'" as possible, 
then the examinations have lost their chief value. The battle 
then becomes one between the regents and the teachers, the 
one taking every possible precaution that no student "gets 
through" who does not deserve to, and the other using every 
device to enable the students to pull through. Instead of a 
device welcomed by the teachers to measure their work by, 
the examinations have become the goal, and the passing of them, 
the victory sought by the students. 

In order to discover whether the failing of low papers was 
confined to a few schools, or whether it was well distributed 
over the whole lot, Table 33, page 82, indicating failures 
by schools, first in various subjects and then in totals, was 
compiled. 
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From this it is plain that the disposition to send low papers 
having generous ratings to the regents to "save as many as 
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Fig. 10. Surface of frequency of regents' marks on papers marked 60 to 
65 inclusive by the teachers. "F" means failed. 

possible" is pretty well distributed among the schools. The 
lowest percentage saved by any school was 22, that for School 8, 
while the highest percentage saved was 75, that for School 29. 
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TABLE 38 

Totals by Schools of Papers Marked Between 60 and 65 Inclusive 
by Teachers and the Numbers of Those Papers Failed by the 
Regents 
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This table reads as follows: Of the 39 English papers from School 1 which the teachers had marked from 60 to 
65 inclusive, the regents failed 10; and of the total 71 low papers which the teachers sent in from School 1 
the regents failed 41, or 57 per cent. 

The average saved to the schools from among these low papers 
was found to be 58.7 per cent. 

But few features of this table call for comment. In the case 
of School 20, all of the seventeen Latin papers were failed, a 
record but little poorer than that made by School 1. School 10 
lost none of the ten mathematics and lost all but one of the ten 
science papers. The relatively high record made by the English 
teachers in grading to suit the regents is a little surprising, 
since the impression prevails that there is less chance of establish- 
ing uniform standards understandable by all the teachers in 
English than in most other subjects. The result in that 
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particular here coincides with the result in the totals of the 
1912 examinations reported several pages back. 

By all these findings concerning the New York State system 
of examinations, we are compelled to conclude that the type of 
examination now in common use is not a successful means of 
standardizing school achievement. 

Through the interest and cooperation of Superintendent Muir 
and his teachers at Orange, N. J., I was able to get the data 
for the following brief experiment in marking. There were two 
aims in mind in gathering the data: first, to find the extent of 
differences in rating elementary school papers by the teachers 
in the grades, and second, to determine the extent of reduction 
of these differences which would be accomplished by having the 
several teachers follow a uniform standard of values for the 
different parts of each question. To this end, Superintendent 
Muir had all of his fifth grade teachers give a uniform arithmetic 
test to their pupils, rate the papers, and send them to him with- 
out any marks upon them. When the papers were thus assem- 
bled he asked one of the teachers who is unusually systematic 
in arithmetic work to make out an appropriate scheme for the 
marking of the papers, which should be simple and yet should 
take account of the various processes involved in the several 
problems. When this was done, a substitute was provided in 
this teacher's room, and she was asked to rate all the papers 
by the scheme she had provided. Afterwards, the teachers of 
the several rooms were asked to rate their own papers again 
by using this teacher's scheme of marking. Thus each paper 
was rated three times. The questions which I desired to inves- 
tigate could be answered by comparing each teacher's ratings 
made by her own method and by the systematic method with 
the ratings of the special teacher (called judge hereafter). These 
comparisons are given in Table 38a, page 84, where distribu- 
tions of differences between the teacher's mark and the judge's 
mark on the same papers are given for each of the six teachers 
who did both ratings. 

From Table 38a it will be observed that there is a very consid- 
erable range of differences when teachers use their own standards 
of marking, there being one fourth of the cases where the judge's 
mark is greater by 3 points or more, and another fourth of the 
cases where the teacher's mark is greater by 5 points or more. 
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TABLE 38a 

The Distributions of Differences Between Two Teachers' Marks on 
Sets op Fifth Grade Arithmetic Papers, First Without Ant Effort 
to Unify the Methods Used, and Second by a Common Standard 
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When the same standard of rating is used by both teacher and 
judge the range of differences is very much reduced, considerably 
more than half the cases being 0. Individual differences among 
teachers appear plainly in the medians at the bottom of the 
columns. While teacher D made a median mark 6 points higher 
than the judge, teacher F made a median mark 4 points lower 
than the judge. As measured by the standard of this judge, the 
teachers differed by 10 points as to the value of equivalent papers. 
From this brief experiment we may draw one lesson: If the 
superintendent expects to place much significance upon the 
uniform tests which he gives he must either have the marking 
done by a single judge, or else must make out a scale for the 
rating of the papers by which the variations of the several 
teachers may be greatly reduced. 



STANDARD TESTS AND SCALES AS AIDS IN STAND- 
ARDIZATION 

As illustrations of the means being advocated during recent 
years for overcoming in part this variability of standards among 
teachers we may mention the following: for arithmetic, Stone 1 
and Courtis 2 ; for handwriting, Ayres 3 and Thorndike 4 ; for com- 
position, Hillegas ; for drawing, Thorndike 6 ; for reading, writing 
and composition together, Courtis 7 ; and for spelling, Bucking- 
ham. 8 It has been impossible for me to examine all of these. 
In fact, it has been impossible for me to examine any of them 
exhaustively. I shall, however, submit some data concerning 
Courtis's arithmetic tests, Thorndike's drawing and writing 
scales, and Hillegas's composition scale. These data will be 
presented as far as possible in such a way that they will be of 
service to anyone who wishes to carry the study further. 

These standard measures are of two distinct types. The 
first, illustrated by the Courtis arithmetic tests, is a special test 
so devised that the rating of the results is wholly objective, 
and practically all variability among markers is, therefore, 
eliminated. The other type is designed to define merit in the 
ordinary productions of the pupils. By their use it is expected 
that the same paper will be given more nearly the same mark 

1 C. W. Stone, Arithmetical Abilities and Some Factors Determining Them, 
Teachers College Contributions to Education, No. 19. 

2 S. A. Courtis, Standard Tests in Arithmetic, 82 Eliot St., Detroit, Mich. 

3 L. P. Ayres, Scale for the Measuring of Quality of Handwriting in Chil- 
dren, Russell Sage Foundation, Publication No. 113. 

4 E. L. Thorndike, Handwriting, Teachers College Record, March, 1910. 

5 M. B. Hillegas, Standard for Measuring the Quality of English Compo- 
sition by Young People, Teachers College Record, Sept., 1912. 

6 E. L. Thorndike, The Measurement of Achievement in Drawing, Teach- 
ers College Record, Nov., 1913. 

7 S. A. Courtis, Standard Tests in Reading, Writing and Composition, 82 
Eliot St., Detroit, Mich. 

8 B. R. Buckingham, Spelling Ability, Its Measurement and Distribution, 
Teachers College Contributions to Education, No. 59. 

I shall omit any extended account either of the nature or origin of the tests 
and scales to be discussed in this section because I assume that anyone inter- 
ested in the discussion will be familiar with them. It is difficult to do justice to 
them with any brief description. All of them are available in their complete 
form as indicated by the above addresses of publishers. 
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by several judges than would be the case without the scales. 
The Hillegas and Thorndike scales are examples of this type. 

Each type has its advantages and its shortcomings. In the 
case of the special test there is greater definiteness, and less varia- 
tion among the judges, but it is narrower in scope and involves 
a great amount of care and labor in its preparation and admin- 
istration. In addition, there is doubtful value in the continued 
use of the same test with the same children. In the case of 
the standard scales the results are less precise because more 
subjective but can be applied to the specimens of the regular 
work of the children. Also they increase in helpfulness with 
time and repeated use. 

In the following discussion of the standard tests or scales 
for measurement, it must be kept in mind that our chief interest 
in this study is the establishment in the minds of the teachers 
of a uniformity of standards such that the injustices which surely 
follow from the variability pointed out in the preceding sections 
may be materially reduced. The data will be available for 
further study of other phases of the tests and scales, but we shall 
be primarily concerned with their serviceableness as instruments 
for the establishing of uniform standards in the minds of teachers 
by which variability of rating a given degree of merit can be 
reduced. 

I. The Courtis Tests in Arithmetic 

The above limitation of my purpose makes unnecessary any- 
thing but the briefest sketch of my findings in regard to the 
Courtis arithmetic tests because by them the rating becomes a 
mechanical process subject to almost no variation. Upon only 
one basis can we properly inquire into the effectiveness of the 
Courtis tests with reference to their soundness as a means of 
measuring merit, and that is the basis of the material which 
is selected as an index of the ability which the author seeks to 
measure. This I shall consider very briefly. 

Two fellow students, Mr. P. P. Brainard and Mr. R. L. Mc- 
Laughlin, were associated with me in a study of the Courtis 
tests in their application to the schools of Hackensack, N. J. 
Under the direction of Superintendent Stark we assisted in 
the administering of the tests, and we did practically all of the 
calculations by which the results were made of service to the 
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superintendent and his teachers. By means of this test it was 
possible to make very definite statements regarding each of the 
eight sorts of arithmetical work called for in the test. Since 
the superintendent had given the same test four months earlier, 
statements of progress as well as condition were possible by 
comparisons with previous records of individuals, rooms, build- 
ings, and the system as a whole. It seems to me beyond ques- 
tion that such information is of great value to the school system 
of Hackensack. The question which is related to my purpose 
is whether the abilities upon which the test enabled us to report, 
are the abilities which we are trying to measure by means of 
the tests. To be specific, is the ability to do single combinations 
in addition, subtraction, multiplication and division a good 
indication that the person can do well the long processes in the 
same fundamentals? If not, then by establishing standards in 
the single combinations we are using a false index of ability, 
because of course the ability which we wish developed is that 
by which success in the long processes is achieved. 

To determine whether there is a close correlation between 
facility with single combinations and with long processes in 
the fundamental operations of arithmetic, I calculated Pearson 
coefficients to indicate this correlation between the sum of a 
pupil's scores in tests 1 to 4 (single combinations tests) and his 
score in "rights" of test 7 (the abstract examples involving 
all of the four fundamental operations). I used six groups of 
children from different grades, selecting approximately fifty 
papers at random from each larger group. (The means which 
I used to assure random selecting was to take the papers just 
as they came in the pile.) The coefficients were found to be 

as follows: 

R's Between Results op 

Courtis Tests 1 to 4, and 

Test 7 Rights 

4B grade 028 

5B grade -20 

6B grade 10 

7B and 7A (Academic Course) 34 

7B and 7A (Commercial Course) — 015 

8B (Commercial Course) .41 

Average • 177 

It thus appears that facility in long processes is not dependent 
primarily upon facility in single combinations. From this we 
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may Conclude that it is ill-advised to try to standardize the work 
in single combinations, especially in the upper grades. Its 
significance has yet to be established even in the lower grades. 

In this connection I wish to call attention to what seems to me 
a significant fallacy in an article by Courtis in the March, 1913, 
Elementary School Teacher. He there states that a very high 
correlation exists between tests 1 to 4 and test 7 attempts ("Pear- 
son coefficient of correlation of .98"), and a "slightly lower" 
correlation between tests 1 to 4 and test 7 rights. He uses score 
sheets from 55,200 children as a basis for his computation and 
naturally his statement carries great weight. The obvious 
corollary to it is that the teacher who gets the best results in 
the single combinations is producing the greatest facility in prac- 
tical processes in the fundamentals. It is, therefore, a matter 
of consequence. 

In Courtis's table he divides his 55,200 children into 45 groups, 
and records with each group its average in tests 1 to 4, and its 
average in test 7, both attempts and rights, in separate columns. 
These columns of averages are the bases of his correlation. His 
groups, although he does not tell us their origin, are presumably 
class groups, the lowest being the third grades of some city, the 
next being the group of the first step higher, and so on up to the 
best twelfth grade at the other end of the Series. It is, then, 
not surprising that the lowest group should be lowest in both 
tests 1 to 4, and test 7, nor that the highest group should be 
highest in both tests. That is just what we should expect 
whether there is any correlation between the two abilities tested 
or not. Both things are taught in school, and as children advance 
in years, they become more efficient in both processes, on the 
average. Even if there is no correlation between the two abili- 
ties in individual children, we should expect to see them improve, 
on the average side by side, just as improvement in either one 
would correlate with physical growth. I contend, then, that 
Courtis's discovery of a high correlation is no indication of corre- 
spondence between the two abilities in individuals, and therefore 
constitutes a mistaken doctrine which it is injurious to advance. 

The above error accompanies, if it does not originate in, 
an attempt to devise a test suitable for all grades alike. The 
use of tests 1 to 4 in the upper grades, anywhere above the fifth, 
cannot be justified. The same attempt has led to another 
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mistake, it seems to me, which I shall mention. The use of test 
8, the two step reasoning test, in grades below the fifth, or per- 
haps the sixth, is hard to defend. According to Courtis's pub- 
lished averages, the achievement for many thousands of children 
indicates the following: 

Rights, Test 8 

Third grade, .6 

Fourth grade, . 8 

Fifth grade, 1.2 

For securing even as high figures as these the provision in his 
method of calculation whereby each child getting none correct 
is credited with .5 of one, and each child getting one, is credited 
with 1.5 and so on, is largely responsible. Thus in the third 
grade, out of a hundred children probably 90 get none right. 
It is absurd to suppose that these ninety averaged a half one 
right. In fact, probably the majority can make absolutely 
nothing out of the jumble of words which constitute the problem. 
The situation is little better in the fourth grade, and it is surely 
vain to try to standardize such processes where achievement is 
so low. Rather let us abandon the notion of a uniform test 
for all grades and adapt the test which we do give to the age of 
the pupils who are to take it. It is probably something of the 
same thought which has prompted Courtis to publish recently 
his separate sheets for testing fundamentals. 

II. The Thorndike Drawing Scale 
In our examination of the scales for measurement of regular 
school products, we shall consider first the Thorndike Drawing 
Scale because tradition has as yet done less to fix a standard 
of any sort for drawing than for most other school subjects. 
The value of the scale can be the more readily pointed out on 
that account. 

As a basis for the study of the drawing scale a set of thirteen 
drawings were rated by from twenty-five to thirty-five teachers 
by both methods, the ordinary percentage method, and with 
the scale. Professor Hillegas very kindly permitted the rating 
to be done by his advanced class in Current Problems in Ele- 
mentary Education during one of his class hours. The samples 
of drawing were those which Professor Thorndike has had printed 
on heavy paper for purposes of experimentation and perfection 
of the scale. It was possible thus to have a copy of the drawings 
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in the hands of all the teachers at one time. The percentage 
rating was made first. They were instructed to rate the drawings 
first as they would if they were fourth grade teachers, and the 
drawings had been done by children in their class. They then 
repeated the rating supposing the drawings to be done by sixth 
grade children. Next they were to consider them as eighth grade 
productions, then as tenth grade, and finally as senior high 
school productions. Thus every teacher who finished the task 
made sixty-five judgments in all. 

After this was finished, the records were taken up and the 
scale for measuring drawing, consisting of fourteen drawings of 
stated values as determined statistically, was handed to each 
teacher and the request made that the thirteen drawings be 
again rated, this time by giving to each one the value which was 
assigned on the scale to the drawing most nearly equal to it in 
merit. Thus we secured the judgment of each of from thirty- 
four to thirty-six teachers on these thirteen drawings by means 
of the scale. Our question concerns itself mainly with a com- 
parison of the two groups of data thus secured. 

The distributions of the judgments by the percentage scale 
are given in the five parts of Table 39, and the distributions of 
judgments by the Thorndike scale are given in Table 40. With 
each division of the tables are given the average of all the judg- 
ments made and the average deviation of the judgments from 
that average. At the lower right hand corner of each table is 
given the average of the thirteen average deviations found for 
that table. The drawings are numbered, the numbers being 
the same as designate the drawings on the sheet prepared by 
Professor Thorndike for experimentation. 

The wisdom of using the average instead of the median as 
the central tendency in these distributions may be questioned. 
The reason which seemed to me to justify it is that the distribu- 
tions are so wide, and often so dispersed in the middle, that the 
median would be shifted considerably away from the average, 
even though the distribution was fairly symmetrical. Of course 
the undistributed extremes of the distributions point to the 
proper use of the median, but, on the other hand, for purposes 
such as these tables are compiled, full weight should be given 
to extreme measures which are far from the central tendency. 
At any rate, probably either measure answers the purpose with 
sufficient accuracy for our use. 
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TABLE 39 

Distribution op Marks Assigned to Samples op Drawings by Teachers. 
The Numbers Correspond to Those Used on the Sheet op Drawings 
Reproduced by Professor Thorndike 



J. When Considered as Fourth Grade Productions 





No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. No. 


No. 


Marks 


IIS 


121 


123 


124 


125 


129 


130 


135 


139 


140 


145 146 


153 


0to2 






2 


8 


















3 to 7 






1 


3 










1 




1 




8 to 12 








5 






1 


1 


1 








13 to 17 


























18 to 22 








3 


1 








1 


1 






23 to 27 


1 




2 




2 








1 








28 to 32 




1 


3 












2 






1 


33 to 37 


























38 to 42 








o 






1 




2 






2 


43 to 47 














1 












48 to 52 


1 


1 


2 


2 


1 


2 


2 


1 


3 


1 




1 


53 to 57 




















1 






58 to 62 






1 


3 


1 




1 


1 




1 


1 


1 


63 to 67 






3 






1 


1 












68 to 72 




1 


7 


2 


1 








4 


1 


2 1 


1 


73 to 77 


1 




2 


1 


3 




4 




3 




1 


3 


78 to 82 




4 


3 




3 




5 


4 


3 


3 


3 


4 


83 to 87 


3 


1 


1 


1 


2 


1 


1 


4 


1 


4 


4 


3 


88 to 92 


9 


5 


2 




12 


6 


5 


3 


4 


7 


3 2 


5 


93 to 97 


7 


7 




1 


2 


8 


3 


5 




4 


8 8 


4 


98 to 100 


9 


11 


2 




3 


12 


4 


10 


3 


6 


6 18 


4 


Totals 


31 


31 


31 


31 


31 


30 


29 


29 


29 


29 


29 29 


29 


Average 


90 


91 


60 


29 


79 


91 


77 


87 


63 


84 


86 97 


80 

Avg. 


A.D. 


8.5 


10.0 


22.5 


27.0 


15.0 


9.0 


16.5 


13.0 


23.5 


12.5 


12.0 4.0 


13.5 14.4 


Note: All the A.D 


/sin 


this table were comi 


uted 


not from the true a 


verage 


but from the nearest mid- 



point of a atep. 



II. When Considered as Sixth Grade Productions 





No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


Marks 


118 


121 


123 


124 


125 


129 


130 


135 


139 


140 


145 


146 


153 


to 2 






4 


16 








1 


5 




1 




1 


3 to 7 






1 


2 






1 














8 to 12 






3 




1 








1 


1 






1 


13 to 17 


1 




1 


2 










1 










18 to 22 








1 


1 










1 






1 


23 to 27 


1 


1 


1 


1 




1 






1 










28 to 32 














2 




2 










33 to 37 




1 
























38 to 42 


1 




5 


4 


2 


1 


1 




1 


2 


1 




1 


43 to 47 














1 




2 








1 


48 to 52 




1 


1 


1 






3 


2 


1 




1 




1 


53 to 57 






3 




1 


1 


1 












1 


58 to 62 




2 


4 


2 


3 








4 




2 




3 


63 to 67 










1 




1 




1 


2 


1 




1 


68 to 72 


1 


1 


3 


1 


2 




3 


1 


2 


3 






3 


73 to 77 


3 


1 


1 




3 




3 


2 


1 


2 


4 




2 


78 to 82 


2 


2 


1 


1 


6 


6 


6 


5 


4 


4 


5 




6 


83 to 87 


9 


5 






3 


3 


4 


4 




3 


2 






88 to 92 


7 


6 


2 




4 


8 




4 


2 


5 


6 




2 


93 to 97 


3 


6 








5 


2 


6 


1 


2 


3 


11 


2 


98 to 100 


3 


5 


1 




2 


5 


1 


4 




4 


2 


8 


2 


Totals 


31 


31 


31 


31 


31 


30 


29 


29 


29 


29 


28 


28 


28 


Averages 


81 


82 


44 


19 


71 


85 


68 


82 


49 


76 


78 


92 


67 

Avg. 
20.0 16.55 


A.D. 


13.0 


14.0 


25.0 


21.5 


15.5 


11.0 


17.0 


13.5 


26.5 


17.0 


13.5 


7.5 
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III. When Considered as Eighth Grade Productions 





No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


Marks 


118 


121 


123 


124 


125 


129 


130 


135 


139 


140 


145 


146 


153 


Oto2 


1 




10 


20 


1 




2 


1 


7 


1 






5 


3 to 7 


















2 


1 






1 


,8 to 12 






1 


1 






1 




1 










}3 to 17 










2 


1 






1 










'8 to 22 


2 


2 


3 


1 






1 




1 


1 








23 to 27 








2 


1 




1 














28 to 32 




1 


4 




1 


1 


1 


1 




1 






3 


33 to 37 






1 




1 


















38 to 42 




1 


1 


2 




1 


1 


3 


5 






2 


2 


43 to 47 














2 












1 


48 to 52 


1 


1 


4 


3 


3 








1 


3 




1 


2 


53 to 57 




1 
























58 to 62 


3 


1 


2 




7 


1 


4 


2 


5 


5 


5 


2 


3 


63 to 67 


1 












2 






1 








68 to 72 


3 


2 


2 




4 


3 


6 


4 


1 


3 


2 




4 


73 to 77 


8 


4 




1 


1 


1 


1 




1 




3 


1 


1 


78 to 82 


4 


7 


2 




3 


7 


4 


5 




3 


4 






83 to 87 


2 


1 






3 


5 




2 


2 


4 


2 


1 


1 


88 to 92 


2 


4 






1 


4 


2 


7 


1 


2 


3 


9 


1 


93 to 97 


2 


3 






1 


3 


1 


1 




2 


2 


4 


2 


98 to 100 


1 


2 








2 




2 




2 


1 


7 


1 


Totals 


30 


30 


30 


30 


29 


29 


29 


28 


28 


29 


27 


27 


27 


Averages 


70 


73 


29 


13 


59 


77 


58 


73 


36 


66 


68 


80 


49 


A.D. 


15 


16.0 


23.0 


18.0 


18.0 


13.5 


22.0 


17.5 


27.0 


20.5 


18.5 


14.0 


27.0 



Avg. 



IV. When Considered as Tenth Grade (Second Year High School) Productions 

No. No. No. No. No. No. No. No. No. No. No. No. No. 
Mares 118 121 123 124 125 129 130 135 139 140 145 146 153 

0to2 

3 to 7 

8 to 12 
13 to 17 
18 to 22 
23 to 27 
28 to 32 
33 to 37 
38 to 42 
43 to 47 
48 to 52 
53 to 57 
68 to 62 
63 to 67 
68 to 72 
73 to 77 
78 to 82 
83 to 87 
88 to 92 
93 to 97 
98 to 100 

Totals 

Averages 

A. D. 18.0 21.0 19.5 13.0 21.0 16.5 23.5 22.0 25.5 24.0 



2 




13 


21 


1 




4 


2 


11 


2 


4 


1 






3 


3 


1 


1 


1 


1 








1 


1 




1 




2 


1 


2 




1 














1 


1 


1 

1 
2 


1 
2 


3 

1 


1 
1 


2 

1 
1 


2 
1 




3 

1 

1 


1 


1 




2 


3 


2 


2 


3 




1 


1 




2 




1 




2 








2 




1 






1 








1 




1 




3 


2 


2 
1 






1 

1 


6 
1 






9 


3 


3 




2 
2 


1 

1 


4 

1 


1 
2 




3 


2 
1 


1 




2 


5 


1 


1 


2 


3 


3 


1 




1 


2 


1 




2 


3 






1 


2 




1 




3 


2 






2 


4 






4 


8 


2 


5 




1 


2 


2 


2 


2 


1 








2 


1 


5 




4 




4 




2 


3 






1 


3 




4 




3 


4 


4 


2 


1 


1 
2 








2 
1 




2 




1 




5 

4 




29 


29 


29 


29 


28 


28 


26 


27 


27 


28 


26 


25 


25 


58 


65 


19 


9 


49 


72 


43 


6S 


27 


58 


53 


78 


37 



Avg. 
;.5 17.5 25.5 20.8 



To make comparisons easy, and to indicate the extent of 
reduction in marks from fourth to sixth grade, and from sixth to 
eighth and so on, I have assembled the lists of averages in a 
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V. When Considered at Twelfth Grade (Sertior High School) Productions 





No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


Marks 


118 


121 


123 


124 


125 


129 


130 


135 


139 


140 


145 


146 


153 


0to2 


4 


2 


16 


22 


4 


1 


6 


2 


14 


3 


5 


2 


8 


3 to 7 




2 


2 


1 


2 


1 


1 






1 


1 






8 to 12 


2 


1 


1 


1 


2 


1 


1 


3 


1 


1 






1 


13 to 17 










1 










1 






1 


18 to 22 


1 




2 


1 


1 




1 


1 


2 


2 




1 


1 


23 to 27 


1 


1 


1 












1 




1 






28 to 32 




1 


1 


2 


2 


1 


3 




2 


2 


1 


1 


1 


33 to 37 


2 


1 














2 








2 


38 to 42 


1 


1 


2 




3 


2 


2 




1 


1 


1 




2 


43 to 47 




1 


1 
















3 






48 to 52 


7 


1 


2 




4 




4 




1 


3 


3 




2 


S3 to 57 












1 












1 




38 to 62 




4 






2 


2 


2 


2 


1 


3 


1 




4 


63 to 67 


2 






1 




2 










2 


1 




88 to 72 


2 


S 






3 


3 


3 


3 




1 


2 




1 


73 to 77 


1 








1 


5 


2 


1 




1 


1 


3 




78 to 82 


3 








3 


4 




6 


2 


5 


1 


2 




83 to 87 




3 












4 




2 


2 


3 


1 


88 to 92 


2 


3 








2 




1 




1 


1 


5 


5 


93 to 97 














2 




1 






3 




98 to 100 




2 












1 








3 




Totals 


28 


28 


28 


28 


28 


27 


25 


25 


27 


27 


25 


25 


25 


Averages 


46 


56 


12 


6 


39 


63 


35 


62 


19 


48 


45 


74 


32 

Avg. 
25.0 21.95 


A. D. 


24.0 


26.5 


14.5 


8.5 


24.0 


19.0 


24.0 


27.0 


20.5 


26.5 


25.0 


21.0 



TABLE 40 

Distribution op Marks Assigned to Samples of Drawings by Teachebs 
Using the Thorndike Scale. The Numbers Correspond to Those 
Used on the Sheet of Drawings Reproduced by Professor Thorn- 
dike 





No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 


No. 




Mares 


118 


121 


123 


124 


125 


129 


130 


135 


139 


140 


145 


146 


153 













9 






















2.4 






3 


15 








1 










2 




3.9 






2 


4 






1 




1 








1 




5.7 






2 


1 


1 




1 




1 












6.5 






8 


7 






1 




3 








1 




7.8 




1 


12 






1 


7 




14 


3 


1 








8.6 






8 




2 


1 


2 




10 


1 


2 




6 




10.5 






1 




4 


2 


10 


1 


5 


2 


4 




6 




11.8 


11 


3 






11 


2 


3 


5 




4 


3 


2 


3 




12.6 


10 


7 






11 


2 


2 


3 


1 


9 


7 




3 




13.5 


7 


15 






4 


3 


5 


7 




1 


6 




1 




14.4 


4 


5 






3 


9 


2 


7 




5 


5 


2 


8 




16.0 


3 


4 








13 




9 




9 


6 


11 


2 




17.0 


1 


1 








3 




2 








19 


1 




Totals 


36 


36 


36 


36 


36 


36 


34 


35 


35 


34 


34 


34 


34 




Averages 


13.03 13.58 


7.08 


3.15 12.13 14.07 10.58 13.89 


8.17 13.00 12.94 16.47 11.08 


Avg. 
1.29 


A. D. 


.99 


.83 


1.18 


.96 


1.09 


1.38 


1.72 


1.23 


.89 


1.71 


1.49 


.88 


2.47 


(laeleps of scale) 





























separate table, No. 41, page 94. In this table the drawings are 
listed in the order of excellence as determined by the average of 
the five averages of ratings made upon them by the percentage 
method. The average by the scale method is listed also. 



94 Teachers' Marks 

TABLE 41 

Giving the Averages op the Ratings Made upon Thirteen Drawings 
bt Teachers When These Drawings Were Considered in Turn, 
Fourth, Sixth, Eighth, Tenth, and Twelfth Grade Productions, 
Using the Customary Percentage Method, and the Ratings by the 
Thorndike Scale 



No. op 
Drawing 


As 4th 


As 6th 


As 8th 


As 10th As 12th 


Avg. 


Thorn- 
dike 
Scale 


124 
123 
139 


29 
60 
63 


19 
44 
49 


13 

29 
36 


9 
19 

27 


6 
12 
19 


15.2 
32.8 
38.8 


3.15 
7.08 
8.17 


153 


80 


67 


49 


37 


32 


53.0 


11.8 


130 
125 
145 
140 


77 
79 
86 

84 


68 
71 
78 
76 


58 
59 
68 
66 


43 
49 
53 
58 


35 
39 
45 
48 


56.2 
59.4 
66.0 
66.4 


10.58 
12.13 
12.94 
13.0 


118 
121 


90 
91 


81 
82 


70 
73 


58 
65 


46 
56 


69.0 
73.4 


13.03 
13.58 


135 
129 


87 
91 


82 
85 


73 

77 


68 

72 


62 
63 


74.4 
77.6 


13.89 
14.07 


146 


97 


92 


80 


78 


74 


84.2 


16.47 


Average 


78.1 


68.8 


57.8 


48.9 


41.3 







From Table 41 it will be observed that this group of teachers 
consider that a drawing is worth about 10 points more for fourth 
grade than for sixth, and about 10 points more for sixth than for 
eighth, and so on. It will be observed further that drawing 139 
is considered about as good for fourth grade as drawing 153 is 
for sixth, and this in turn as good as drawing 140 for eighth, 
and 121 for tenth, and 129 for twelfth. The average value of 
these five drawings by the scale is seen by the right hand column 
of the table to be, in order, 8.17, 11.8, 13.0, 13.58, 14.07. These 
values indicate steps of difference between what is expected from 
the successive grades, as follows: 

From 4th to 6th grade, a gain of 2 . 63 units of the scale. 
" 6th " 8th " " " 1.2 " " " " 

8th " 10th " " " .58 " " " " 

" 10th " 12th " " " .49 " " " " 

Comparing this rapid decrease in the amount of gain expected 
from grade to grade as we advance in the grades, with the fact 
pointed out above that the drop in percentage from grade to 
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grade is about constant, being about 9 or 10 points for each 
two-year change of grade, we have revealed one of the most con- 
spicuous defects in our present system of marking. While 
each two years are expected to add 10 points value by the per- 
centage scale, they are in fact expected to add ever-decreasing 
amounts of actual value. These points on the percentage scale 
do not have a fixed value, and they vary in the different portions 
of the scale, and under different circumstances. There being 
no absolute standard fixed, each judge attaches his own value to 
the scale. As seen in this case, there is no consistency about 
it. There is for any drawing, on the average, about a 10 point 
higher standard required from fourth to sixth to eighth, and so 
on, but the papers valued practically of equal merit for the 
successive groups differ by very unequal amounts. Even the 
averages of the percentage markings on the five drawings which 
come nearest to a rating of 65 in the five successive groups, 
stand at very unequal intervals. These averages are, respec- 
tively, 38.8, 53, 66.4, 73.4, and 77.6, indicating increases of 14.2, 
13.4, 7.0, and 4.2, respectively. 

Before turning to the comparison of variabilities accompany- 
ing the two methods of rating, I wish to point out the evidence 
of the great diversity of standards held by the teachers. In 
order most clearly to point this out, I computed the difference 
between each teacher's judgment on each paper and the average 
of all the judgments on the same paper. For example, the 
average judgment of all teachers upon drawing 124 as a fourth 
grade production is seen to be 29. If a teacher rated it at 35 
he was credited with a plus difference of 6. Similarly all the 
differences were calculated, and then the sum of all the plus 
differences and the sum of all the minus differences computed for 
each teacher separately. The same thing was then done for the 
judgments made by the Thorndike scale. These sums are 
tabulated in Table 42. 

One very significant fact is revealed by this table. By the 
percentage method of rating there is a marked tendency for a 
teacher to be either much above, or much below the average on 
practically all papers. This is indicated by the wide difference 
between the plus and minus sums in the case of a great many 
teachers. The meaning is very plain. The teachers have as yet 
no uniform idea of how well a child in a certain grade should be 
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TABLE 42 

Showing the Sum op the Divergences, Negative and Positive, and the 
Average Divergence op All Mares Given by Each Judge prom the 
Average op the Marks op All Judges on the Same Drawings 

Thirteen drawings were rated five times by the percentage scale, considered 
in succession as fourth grade, sixth grade, eighth grade, tenth grade and twelfth 
grade drawings, and once by the Thorndike scale. 
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expected to draw. A great improvement is effected in this respect 
by .the use of the scale. As a measure of this improvement we 
may compare the ratio of the sum of the differences between the 
pairs in the plus and minus columns to the sum of the sums, for 
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the two methods of rating. For the percentage method the 
sum of the two columns headed "Sum of judgments less than 
average" and "Sum of judgments greater than average," is 
34,119. The sum of the two corresponding columns for the 
Thorndike scale is 724.15. If now we calculate the differences 
between each pair constituting the two columns by each method, 
and then add these differences, we find that the sum in the case 
of the percentage method is 24,691, or 72 per cent as great as the 
sum of the two columns. The sum in the case of the scale method 
is 264.51 or 37 per cent of the sum of the two columns. There 
can be no question, then, but that the use of the scale tends 
decidedly to secure uniformity of standards of rating drawings. 

I was interested to discover whether the variation of standards 
bore any relation to the type of experience which the various 
teachers had had who rated the drawings. Accordingly I asked 
the teachers to answer the question, "In what grades have you 
had teaching experience?" Opposite the numbers of the judges 
as they appear in Table 42, I tabulated their answers. I made 
only six classifications into which the experience would be placed, 
namely, (1) kindergarten or primary, or both; (2) intermediate 
grades; (3) upper grammar grades; (4) high school; (5) none; (6) 
not stated. This tabulation appears in the column at the ex- 
treme right of Table 42. By it we may see that no marked 
influence upon the standards is made by previous experience. 
There seems to be a slight tendency for primary teachers to 
demand more than do upper grade teachers, although this may 
be a mere chance indication for the few teachers here represented. 

Before closing the discussion of Table 42 I wish to make plain 
what may at first sight seem to be an error of calculation. It will 
be observed that the sums of the two columns of plus and minus 
differences are not equal, and that the average of the column of 
average divergences is not the same as the average of the average 
deviations given in Table 40. Two facts account for these 
seeming errors. The steps of the Thorndike scale are unequal, 
the larger differences being found in the main at the lower end 
of the scale. The averages from which the divergences were all 
computed were derived by considering the steps all equal, that 
is by the short method of guessed averages corrected by plus and 
minus divergences. The average deviations in Table 40 were 
also in terms of steps. When, however, we came to calculate 

8 
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the difference between each teacher's mark and the average mark 
for that paper, we subtracted the two figures, thus securing the 
difference not in terms of steps, but in terms of units. This 
makes the average found in Table 42 larger than that in Table 
40, because one is in terms of units, and the other in terms of 
steps. The fact that the lower ranges of the scale contain 
steps of a larger number of units, makes the minus difference 
column of Table 42, larger than the plus difference column. 

Returning now to the question of variability of judgments by 
the two methods of rating, we shall use Part III of Table 39 to 
compare with Table 40. This selection is made because it 
represents the median grade considered, and because the average 
deviations found are also midway between those of the grades 
below them and the grades above them. 

We note by the tables that the average of the average devia- 
tions by the percentage scale is 19.25 points on the scale, while 
for the Thorndike scale method, the average deviation is 1.29 
steps of the scale. Since the latter is calculated in steps, we shall 
have to think of the scale as consisting not of 17 units, but of 14 
steps. To compare the deviations at all, it is necessary to reduce 
the value of the step on the Thorndike scale to units on the per- 
centage scale. There is no absolutely correct way of doing this, 
but we may get an approximation which is near enough to justify 
our main conclusion. The range between the values of the 
poorest and best drawings on the one scale is about equal to the 
range between the poorest and best drawings on the other scale. 
If now we take the range between the average of the two poorest 
and the average of the two best in both cases, we shall come 
close enough to the relative size of steps on the two scales for our 
purposes. 

By this calculation we get the following: 

Lowek Limit Upper Limit Difference 

Percentage scale 21 78 . 5 57 . 5 

Thorndike scale 5.12 15.77 9.18 

The derivation of the 9.18 as the difference on the Thorndike 
scale is as follows: 5.12 is .32 of a step below 5.7, the value next 
above it on the scale. Likewise, 15.77 is .86 of a step above 
14.4, the value next below it. Between 5.7 and 14.4 there are 8 
steps on the scale. Then between 5.12 and 15.77 there are 8 
plus .32 plus .86 steps, or 9.18 steps. 
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The value of a step of the Thorndike scale in terms of the 
percentage scale thus calculated is 6.26 points. We may now 
reduce the average deviations in the Thorndike scale to their 
equivalents in points of the percentage scale. Multiplying 1.29 
by 6.26 we get 8.08 points on the percentage scale representing 
the average deviation of the judgments by the Thorndike scale. 
Comparing this with 19.25, the average deviation by the percent- 
age method, we see that the variability by the Thorndike scale 
is only 42 per cent as great as by the percentage method. 

III. The Thobndike Handwbiting Scale 

A noteworthy experiment with the handwriting scales of both 
Ayres and Thorndike was conducted by Starch 1 at the Uni- 
versity of Wisconsin during 1913. He had fifteen specimens of 
children's writing rated by ten business men, and ten teachers 
in each of three ways : The percentage scale, the Ayres scale, and 
the Thorndike scale. The order of papers was changed between 
each rating, and the order of methods of rating was changed 
from judge to judge. The average deviations were calculated 
for each group of judges, business men and teachers separately, 
upon each paper and the average of the 15 average deviations 
used as a basis of comparison of variability by the different 
methods. The instructions for the percentage scale ratings were 
that 100 was to be considered perfect writing, and to be con- 
sidered writing with no merit. To translate steps of the Thorn- 
dike scale into units of the percentage scale, to 100 on percent- 
age scale was considered equal to to 18 on Thorndike scale. 
With the Ayres scale the problem was not so simple, but Starch 
adopted the method of equating the range from the poorest mark 
to the best mark on the two scales, and the range from the next 
to the poorest to the next to the best, and so on, and using the 
average of all these equations as the value of the Ayres step in 
terms of the percentage scale. 

As a result of this calculation he found the A.D.'s to compare 

as follows: 

Thorndike Ayres Percentage 

Scale Scale Scale 

Businessmen 6.32 6.04 10.04 

Teachers 5.66 5.49 10.39 

1 Daniel Starch, The Measurement of Handwriting, The Journal of Edu- 
cational Psychology, 4: 445. 
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From this we see that the variability by the Thorndike scale is, 
on the average for the two groups of judges, but 58.6 per cent as 
great as with the percentage scale. 

These judges were all without practice in the use of the scale. 
Starch claims that with practice the variability can be reduced 
by nearly half. He does not give the basis of that opinion beyond 
the fact that his own judgments are only about half as variable 
from the average of all the judgments on the papers as the aver- 
age of the unpracticed group. Of course that fact proves nothing 
about the effect of practice, but rather indicates how much better 
some people can use the scale than others. It would seem per- 
fectly natural that practice should improve the efficiency of a 
judge in using the scale, but we have no proof as yet of the claim. 

If the above study reveals the true gain in reliability of marking 
by means of the scale, it will prove a wonderful aid in standardiza- 
tion. A question arises in my mind as to the propriety of the 
instructions concerning the use of the percentage scale. In 
actual practice we do not think of the percentage scale as the 
distance between merit and perfection. We always have a 
standard of some sort in mind, and use the 100 points to indicate 
the attainment of that standard. For example, if the teachers 
are rating a group of penmanship papers they always know what 
class of pupils wrote them, and they rate the papers on the basis 
of what they consider a proper standard of requirement for that 
grade of pupils. According to whether that standard is uniform 
or not in the several teachers who may be called upon to rate the 
paper, the ratings will be uniform or variable. In other words, 
it is possible that the concept of "standard work for seventh 
grade children" may be a much more uniform thing among 
teachers than the concept "perfect work." If so, then to indi- 
cate how serviceable the scale is for removing variability of 
marking we should have to have the papers rated on this basis, 
letting the teachers know what grade is being rated. 

With this in mind I secured the ratings upon fifty papers, se- 
lected at random from fifth, sixth, seventh, and eighth grade 
papers, by sixteen teachers, using the regular system prevailing in 
the school, and then using the Thorndike scale. 1 The teachers 



1 These data were gathered in Providence, R. I., under the direction of R. L. 
McLaughlin, principal of Rochambeau Avenue Grammar School. My thanks 
are hereby expressed to him and his teachers. 
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were asked to rate each sample on the basis of what they consid- 
ered to be the proper grammar school (eighth grade) achievement. 
The system of marking which prevails in that city is that of 
letters, E, G, F, and P for excellent, good, fair, and poor, respect- 
ively. In tabulating the returns, these letters were changed to 
9, 8, 7, and 6, respectively. The tabulations for the sixteen 
judges' marks by the letter method are found in Table 43, and 

TABLE 43 

A Set of Fifty Samples of Children's Handwriting Rated by the Common 
Letter Method, P, F, G, and E, by Teachers. These Letters Were 
Changed into Figures, 6, 7, 8, and 9, Respectively 
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TABLE 44 

Ratings by the Thorndike Scale Upon a Set op Children's Handwbitings 
by Teachers. Same Papers and Same Judges as in Table 43 



PiPKIt 



Judges 



Atg. A. D. 





1 


* 


S 


4 


S 


e 


7 


8 


9 


10 


11 


It 


IS 


14 


16 


16 






1 


9 


9 


9 


9 


10 


8 


9 


9 


11 


11 


11 


8 


11 


9 


8 


7 


9.25 


.97 


2 


8 


8 


9 


8 


14 


9 


9 


8 


14 


8 


8 


11 


11 


8 


9 


9 


9.44 


1.53 


3 


9 


7 


9 


7 


11 


9 


8 


9 


9 


9 


9 


9 


9 


8 


8 


8 


8.56 


.72 


4 


9 


13 


11 


9 


11 


11 


11 


13 


13 


13 


11 


13 


14 


11 


11 


11 


10.56 


1.20 


5 


11 


8 


11 


9 


10 


13 


11 


8 


15 


13 


14 


9 


11 


9 


10 


11 


10.81 


1.59 


6 


11 


11 


9 


10 


9 


9 


12 


9 


11 


12 


12 


12 


12 


11 


10 


12 


10.75 


1.06 


7 


9 


9 


10 


9 


8 


9 


9 


8 


9 


12 


15 


9 


12 


9 


9 


13 


9.94 


1.54 


8 


12 


11 


11 


9 


9 


9 


12 


8 


13 


13 


14 


9 


12 


9 


11 


13 


10.94 


1.58 


9 


11 


11 


11 


9 


10 


12 


13 


8 


12 


11 


14 


11 


12 


11 


11 


11 


11.13 


.92 


10 


9 


13 


11 


10 


11 


13 


15 


8 


16 


11 


12 


12 


13 


9 


10 


12 


11.50 


1.56 


11 


11 


14 


11 


10 


9 


14 


11 


14 


13 


14 


9 


14 


13 


9 


10 


11 


11.68 


1.77 


12 


9 


13 


10 


9 


11 


11 


11 


9 


13 


11 


13 


11 


11 


11 


11 


11 


10.94 


.84 


13 


8 


11 


11 


10 


8 


11 


13 


12 


11 


13 


9 


13 


11 


13 


11 


13 


11.13 


1.28 


14 


8 


13 


11 


10 


9 


13 


12 


8 


11 


11 


12 


12 


12 


11 


11 


13 


11.06 


1.19 


15 


9 


12 


12 


10 


8 


9 


12 


9 


12 


12 


12 


12 


15 


12 


10 


12 


11.13 


1.53 


16 


12 


11 


14 


10 


13 


13 


13 


11 


14 


14 


14 


14 


14 


9 


11 


14 


12.56 


1.42 


17 


11 


9 


14 


11 


13 


14 


14 


8 


13 


16 


11 


9 


13 


12 


13 


14 


12.19 


1.79 


18 


11 


11 


13 


11 


14 


13 


15 


8 


11 


11 


12 


12 


13 


13 


11 


11 


11.87 


1.25 


19 


11 


13 


12 


10 


10 


13 


11 


13 


11 


13 


13 


11 


13 


9 


11 


12 


11.63 


1.12 


20 


12 


12 


12 


10 


9 


9 


11 


8 


9 


13 


14 


9 


12 


11 


11 


11 


10.81 


1.36 


21 


12 


12 


13 


11 


9 


9 


12 


14 


12 


13 


12 


12 


12 


11 


11 


13 


11.75 


.97 


22 


15 


15 


15 


13 


11 


13 


13 


13 


13 


17 


13 


13 


13 


14 


13 


13 


13.56 


1.02 


23 


11 


13 


11 


11 


11 


11 


13 


8 


11 


11 


16 


13 


11 


13 


11 


9 


11.50 


1.31 


24 


14 


15 


17 


11 


11 


11 


13 


8 


13 


13 


14 


13 


13 


13 


13 


13 


12.81 


1.28 


25 


14 


11 


16 


11 


12 


13 


11 


11 


12 


13 


13 


13 


12 


13 


13 


14 


12.63 


1.05 


26 


13 


15 


15 


10 


9 


9 


12 


8 


9 


16 


16 


12 


12 


12 


13 


13 


12.13 


2.02 


27 


13 


12 


15 


11 


9 


12 


15 


9 


12 


12 


13 


12 


12 


12 


13 


15 


12.44 


1.23 


28 


15 


14 


16 


11 


9 


13 


15 


11 


12 


12 


12 


13 


13 


13 


13 


13 


12.81 


1.23 


29 


13 


11 


16 


12 


13 


13 


13 


11 


13 


13 


13 


13 


11 


16 


14 


11 


12.87 


1.05 


30 


17 


16 


18 


13 


15 


13 


13 


13 


15 


14 


13 


14 


13 


13 


14 


15 


14.31 


1.30 


31 


12 


13 


14 


10 


14 


9 


12 


8 


13 


11 


16 


11 


12 


11 


12 


11 


11.81 


1.46 


32 


15 


11 


17 


13 


15 


14 


15 


9 


16 


14 


11 


13 


13 


14 


16 


14 


13.75 


1.56 


33 


17 


16 


17 


13 


12 


14 


15 


9 


13 


13 


13 


13 


13 


13 


13 


15 


13.69 


1.48 


34 


16 


13 


16 


13 


13 


14 


16 


11 


16 


14 


14 


14 


14 


14 


14 


14 


14.13 


.91 


35 


14 


12 


18 


14 


12 


12 


12 


12 


16 


15 


16 


12 


15 


16 


13 


13 


13.87 


1.63 


36 


16 


16 


16 


15 


13 


15 


12 


14 


13 


16 


12 


15 


16 


13 


12 


15 


14.31 


1.39 


37 


15 


14 


16 


13 


15 


14 


15 


11 


11 


14 


11 


IS 


16 


13 


14 


14 


13.87 


1.29 


38 


15 


11 


16 


13 


15 


14 


15 


11 


11 


14 


16 


16 


16 


13 


13 


13 


13.87 


1.52 


39 


15 


14 


16 


13 


15 


13 


10 


11 


13 


14 


15 


14 


13 


16 


13 


14 


14.06 


1.08 


40 


16 


16 


17 


14 


14 


12 


15 


13 


16 


13 


17 


15 


18 


13 


15 


15 


14.94 


1.33 


41 


16 


14 


16 


13 


15 


13 


14 


13 


13 


13 


16 


14 


14 


16 


13 


13 


14.13 


1.05 


42 


15 


13 


16 


13 


14 


12 


14 


13 


15 


12 


17 


13 


13 


14 


13 


13 


13.75 


1.09 


43 


11 


11 


12 


15 


15 


14 


9 


11 


11 


13 


13 


11 


9 


11 


11 


11 


11.75 


1.44 


44 


12 


14 


12 


13 


13 


13 


13 


9 


13 


13 


16 


15 


15 


13 


15 


13 


13.25 


1.09 


45 


16 


16 


16 


15 


15 


13 


15 


12 


15 


15 


16 


15 


13 


15 


15 


15 


14.81 


.80 


46 


14 


12 


15 


14 


11 


11 


15 


13 


17 


17 


17 


11 


15 


12 


15 


15 


14.00 


1.75 


47 


10 


8 


9 


9 


11 


9 


9 


8 


9 


8 


9 


11 


8 


9 


9 


8 


9.00 


.62 


48 


10 


8 


9 


9 


9 


9 


9 


9 


9 


9 


8 


9 


11 


8 


9 


9 


9.00 


.47 


49 


11 


14 


12 


12 


14 


14 


13 


9 


11 


11 


9 


12 


13 


11 


11 


11 


11.75 


1.25 


50 


11 


9 


11 


12 


10 


12 


12 


8 


13 


12 


12 


12 


11 


9 


9 


12 


10.94 


1.21 


rages 


11.8 12.313.010.9 11.3 12.5 12.6 


9.3 12.7 12.8 13.0 12.4 12.7 11.9 11.5 


12.7 


12.11 


1.22! 



the tabulations by the Thorndike scale are found in Table 44. 
A comparison of the variabilities there shown will now be made. 
The task of translating the steps of the Thorndike scale into 
steps of the letter scale is difficult on account of the narrowness 
of range of the letter scale. However, the method used for 
making the change in the drawing scales will be employed, except 
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that the average of the lowest five and the average of the highest 
five will replace the averages of the lowest and highest two 
respectively. The low and high extremes by both scales are 
found by this method to be as follows: 

Low Extreme Extreme Difference 

Thomdike scale 9.05 14.50 5.45 

Letter method 6.374 8.064 1.69 

Equating these differences or ranges, we have one step of the 
letter scale equal to 3.22 steps of the Thomdike scale. By 
Table 43 the average of the average deviations by the letter 
scale is seen to be .3899. Converting this into steps of the 
Thomdike scale by multiplying by 3.22, we get 1.255. This is a 
trifle larger than the average of the variations found by the 
Thomdike scale, which is 1.222. The situation is reversed, 
however, if we make the correction of the A. D.'s for coarse 
grouping. The distributions spread over about six steps in the 
Thomdike scale, and over about three steps in the letter scale. 
It seems as nearly correct as we can estimate to subtract .04 
steps from the A. D. of the letter scale, and .02 from the A. D. of 
the Thomdike scale. 1 That will leave the A. D. for the letter 
scale .3499 steps, or 1.117 in terms of units of the Thomdike 
scale. In like manner the correction will leave the A. D. for the 
Thomdike scale 1.202. It seems, then, that as far as this experi- 
ment goes, the Thomdike scale does not effect a reduction of 
variability among judges when the customary standard of the 
school is used instead of the unfamiliar standard of "zero merit" 
to "perfect writing." Of course, we must not forget that this 
lower variability is accomplished with long practice in using the 
standard of "Grammar Grade Achievement." What practice 
will accomplish with the standard scale is yet to be discovered. 

Variability among the several judgments upon the same paper 
is not the only phase of variability which we wish to avoid by 
standardization. Some teachers grade all papers high, while 
other teachers grade all papers low by the common marking 
system. It is important to inquire whether that sort of varia- 
bility is reduced by the use of the scale. To answer this inquiry 
I averaged the marks of each teacher upon the 50 papers by both 



1 E. L. Thomdike, Mental and Social Measurements, page 55. 
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methods of rating. These averages are found in the tables, but 
are reproduced below for purposes of comparison: 



Judge 


Ave of Letter 
Judgments 


Judge 


Avg. of Scale 
Judgments 


13 


6.72 


8 


9.3 


8 


6.84 


4 


10.9 


2 


7.08 


5 


11.3 


7 


7.28 


15 


11.5 


4 


7.28 


1 


11.8 


5 


7.30 


14 


11.9 


12 


7.30 


2 


12.3 


14 


7.30 


12 


12.4 


15 


7.32 


6 


12.5 


3 


7.34 


7 


12.6 


6 


7.42 


9 


12.7 


1 


7.44 


13 


12.7 


10 


7.52 


16 


12.7 


11 


7.54 


10 


12.8 


16 


7.54 


11 


13.0 


9 


7.64 


3 


13.0 


Average 


7.30 




12.1 


A. D. 


.167 




.725 



From these lists of averages we compute the average deviations 
from the averages in each column and find that the A. D. for the 
letter scale column is .167 while the A. D. for the scale column 
is .725. Converting .167 into units of the Thorndike scale by 
multiplying it by 3.22, we get .538 as the average deviation from 
the average among the sixteen teachers' average judgments of 
the fifty papers by the letter scale, as compared with .725 for the 
deviation among the judgments by the Thorndike scale. From 
this it seems that the handwriting scale is not entirely successful 
in leveling the varying standards among judges. Some judges 
rate all papers high by the scale and some others rate all papers 
low by the scale, on an average, more than they do by the letter 
method. In all probability, this would not be true with a group 
of teachers who had not had long experience in the grades to 
establish by practice a fairly uniform and fairly definite standard 
to assign letters by. 

To determine whether this variation among judges by the 
scale was usual or exceptional, I secured the ratings made by 
six graduate students on each of two sets of papers, thirty-one 
seventh grade papers, and thirty-one graduate students' papers, 1 
using the Thorndike scale. The distributions of these ratings 

1 These data were secured by Messrs. W. T. Bawden and J. Riley, to whom 
my thanks are hereby expressed for permission to use them. 
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are given in Table 45. The average of each judge's ratings on 
the thirty-one papers is given at the foot of the columns. The 
average deviation from the average among the six averages for 
seventh grade papers is .883, while for the other set of papers it 
is .633. If the average of these two marks be taken as typical 
of the rating of these six graduate students, all of whom have a 
vital interest in the problem of standardization, we find that it 
is larger than the deviation among the average judgments recorded 
for the teachers in Table 44. It seems then that we may expect 
from unpracticed judges about .75 of a step average deviation 
of their average judgment of a set of papers from that of any 
competent group of judges, and from 1 to 1.25 of a step average 
deviation among a group of judges on the same paper. 



TABLE 45 

The Judgments Upon Two Sets of Samples of Handwriting by Each 
of Six Judges, Graduate Students in Teachers College, Using 
the Thorndike Scale for Handwriting 



Samples of Seventh Grade Writing 
Paver J.l J.2 /.S J4 J-S J.6 Atg. A. D. 



1 


15 


15 


15 


14 


16 


17 


15.33 


.78 


1 


15 


12 


12 


13 


13 


17 


13.67 


1.55 


2 


15 


14 


14 


13 


15 


16 


14.60 


.83 


2 


12 


13 


11 


12 


12 


9 


11.50 


1.00 


3 


10 


12 


9 


11 


12 


9 


10.50 


1.17 


3 


14 


11 


11 


11 


11 


12 


11.67 


.89 


4 


14 


9 


11 


12 


14 


12 


12.00 


1.33 


4 


14 


13 


12 


13 


14 


11 


12.83 


.89 


5 


14 


11 


12 


11 


12 


13 


12.17 


.89 


5 


8 


9 


8 


9 


9 


8 


8.50 


.50 


6 


12 


13 


14 


12 


14 


15 


13.33 


1.00 


6 


14 


15 


13 


14 


16 


13 


14.17 


.89 


7 


15 


15 


15 


14 


16 


17 


15.33 


.78 


7 


12 


14 


12 


15 


16 


13 


13.67 


1.33 


8 


13 


13 


13 


13 


14 


16 


13.67 


.89 


8 


14 


13 


12 


15 


14 


15 


13.83 


.89 


9 


15 


15 


16 


13 


12 


16 


14.50 


1.33 


9 


14 


12 


11 


12 


11 


13 


12.17 


.89 


10 


15 


15 


13 


14 


11 


16 


14.00 


1.33 


10 


12 


11 


11 


12 


10 


12 


11.33 


.67 


11 


15 


15 


11 


12 


11 


15 


13.17 


1.83 


11 


13 


12 


11 


12 


12 


12 


12.00 


.33 


12 


14 


13 


11 


13 


11 


14 


12.67 


1.11 


12 


13 


11 


12 


12 


11 


14 


12.17 


.89 


13 


15 


15 


14 


14 


13 


15 


14.33 


.67 


13 


14 


11 


11 


13 


12 


11 


12.00 


1.00 


14 


11 


11 


13 


12 


12 


13 


12.00 


.67 


14 


10 


9 


13 


11 


13 


9 


10.83 


1.50 


15 


12 


15 


8 


11 


14 


15 


12.50 


2.17 


15 


12 


13 


10 


11 


14 


9 


11.50 


1.05 


16 


13 


9 


12 


11 


13 


11 


11.33 


1.00 


16 


15 


15 


14 


16 


16 


16 


15.33 


.67 


17 


16 


11 


15 


12 


14 


13 


13.50 


1.50 


17 


9 


9 


8 


8 


9 


8 


8.50 


.50 


18 


13 


11 


13 


12 


12 


13 


12.33 


.67 


18 


16 


14 


12 


11 


13 


16 


13.67 


1.67 


19 


15 


12 


16 


13 


17 


15 


14.67 


1.45 


19 


15 


13 


8 


11 


10 


14 


11.83 


2.17 


20 


14 


15 


12 


14 


16 


15 


14.33 


1.00 


20 


11 


11 


10 


11 


11 


12 


11.00 


.33 


21 


11 


13 


9 


10 


10 


14 


11.17 


1.55 


21 


12 


14 


11 


9 


9 


13 


11.33 


1.67 


22 


15 


9 


12 


12 


11 


16 


12.50 


2.00 


22 


15 


13 


12 


11 


11 


12 


12.33 


1.11 


23 


10 


9 


10 


9 


10 


9 


9.50 


.50 


23 


14 


12 


13 


11 


14 


13 


12.83 


.89 


24 


11 


10 


9 


10 


12 


11 


10.50 


.83 


24 


11 


12 


11 


13 


15 


13 


12.50 


1.17 


25 


15 


10 


14 


13 


13 


16 


13.50 


1.50 


25 


14 


10 


12 


14 


16 


15 


13.50 


1.67 


26 


15 


9 


15 


14 


11 


16 


13.33 


2.22 


26 


13 


9 


10 


11 


13 


13 


11.33 


1.67 


27 


12 


11 


14 


13 


10 


15 


12.50 


1.50 


27 


15 


12 


13 


15 


15 


14 


14.00 


1.00 


28 


15 


8 


8 


11 


9 


12 


10.50 


2.17 


28 


12 


12 


12 


12 


13 


13 


12.33 


.44 


29 


12 


13 


9 


13 


13 


14 


12.33 


1.22 


29 


15 


9 


11 


11 


11 


14 


11.83 


1.78 


30 


15 


14 


13 


12 


13 


16 


13.83 


1.17 


30 


12 


13 


13 


14 


15 


15 


13.66 


1.00 


31 


13 


15 


14 


13 


10 


15 


13.33 


1.33 


31 


13 


11 


14 


12 


14 


15 


13.33 


1.17 



14.9 12.712.9 12.6 12.3 14/9 



1.17 



Samples op Graduate Students Writing 
Paper J.l JS J.S J4 J-S J.6 Aug. A.D. 



13.4 12.0 ll.S 11.8 12.9 13.0 



1.00 



There is one other significant inquiry relating to variability 
among judgments made by the scale. How does the amount 
of variation between two successive judgments by the same 
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judge compare with the variation between different judges on 
the same papers? For a tentative answer to this question I 
submit the data 1 which give the two successive judgments 
of four competent judges upon each of twenty-two speci- 
mens of handwriting, the judgments having been made several 
days apart in each case. These data are given in Table 46. To 
make the desired comparison it was necessary to calculate the 
difference between the judgment of each judge and that of each 
other judge on the same paper, and average those differences 
and then calculate the difference between the two successive 
judgments of each judge, and average those differences. The 
comparison of these two averages makes a fair answer to the 
question asked above. 

TABLE 46 

Ratings of Four Graduate Students op Teachers College Upon Each 
of Twenty-Two Specimens of Handwriting by Means of the Thorn- 
dike Scale. A Second Series of Ratings Made by the Same Judges 
Several Days Later Are Recorded for Purposes of Comparison 



Papers 


Judge I 


Judge II 


Judge III 


Judge IV 




1st 2nd 


1st 2nd 


1st 


2nd 


1st 2nd 


1 


9 9 


10 11 


11 


10 


9 8 


2 


14 14 


14 14 


13 


13 


13 13 


3 


11 10 


13 9 


12 


12 


10 10 


4 


14 14 


13 14 


14 


13 


12 13 


5 


9 9 


14 13 


10 


9 


12 11 


6 


13 11 


13 12 


11 


12 


11 12 


7 


11 10 


10 11 


9 


10 


11 14 


8 


11 10 


12 11 


12 


11 


9 11 


9 


13 12 


11 10 


11 


13 


10 11 


10 


9 9 


10 11 


9 


9 


8 10 


11 


9 9 


9 11 


10 


11 


8 10 


12 


11 10 


11 10 


12 


10 


9 10 


13 


11 11 


10 9 


10 


10 


9 9 


14 


10 9 


10 10 


8 


9 


8 9 


15 


9 9 


9 9 


10 


11 


9 10 


16 


11 10 


11 14 


14 


13 


11 11 


17 


13 13 


12 14 


14 


14 


12 13 


18 


11 11 


11 11 


11 


11 


11 11 


19 


10 10 


11 11 


11 


10 


8 10 


20 


10 10 


8 11 


11 


10 


8 9 


21 


12 12 


11 11 


13 


12 


9 11 


22 


12 10 


12 12 


14 


14 


10 12 



The differences thus determined are tabulated for each of the 
twenty-two papers in Table 47. In the column next to the 
numbers of the papers, for example, are given the differences 
between the ratings of judges I and II on their first judgments 

1 These data were secured by R. O. Runnells, to whom my thanks are hereby 
expressed. 
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of the set, and in the first column under "Later Series" are 
given the differences between the ratings of the same two judges 
on their second judgments of the set. All of these twelve differ- 
ences are averaged for the column "Avg. of 12 differences." 
In the next column to the right of this column of averages are 
given the extreme differences between the lowest and highest 
mark given to each paper in all of the eight judgments. Finally 
on the right of this column are given the differences between the 
two judgments of each judge. 

TABLE 47 

Differences Among the Ratings Recorded in Table 46, in Terms of 
Steps on the Thorndike Scale. Each Judge Compared with Each 
Other Judge and Each Judge Compared with Himself 
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By Table 47, we note that the average difference between one 
judge and another is 1.22 steps of the scale, while the average 
difference between the two judgments of the same judge is .864 
steps of the scale. A judge is, then, less variable with himself 
than with other competent judges. A reasonable interpretation 
of this fact would seem to be that one factor in the production 
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of the variability in rating among competent judges is the vary- 
ing standards of merit previously established in the judges' 
minds. Otherwise, if the variability were produced by chance 
inaccuracies in comparing the specimen with the scale, there 
should be found as great differences between successive judgments 
of the same judge, as between the judgments of different judges. 
Since this is found not to be the case, it is pretty good indication 
that when practice with the scale has established in the minds 
of judges the same standard of merit which the scale represents, 
the variability among judges will decrease markedly. Certainly 
for those teachers who begin their teaching with the scale stand- 
ard as their guide, familiarity with any other standard will be 
unlikely to enter as a factor to produce variability. There is, 
therefore, strong probability that the variability found among 
unpracticed judges is much greater than will be found among 
those who make regular use of the scale, while the variability by 
the letter or per cent scales with which we have made comparison 
in this experiment would be found greater if we had teachers 
with less experience. In other words, the teachers have reduced 
the variability shown by the per cent method, by practice at the 
expense of children, while they have at the same time decreased 
their capacity for effective use of a standard scale. 

It may be worth while to cite as further evidence on this point, 
the reduced amounts of difference shown in the second series of 
judgments over those of the first series. These differences are 
averaged at the foot of Table 47, and from the footings we may 
determine that the average difference for the first series is 1.35 
steps, while the average difference for the second series is 1.09 
steps. If this is a fair indication of the effect of practice, we may 
expect easily to overcome the major part of the variability found 
in these experiments by a practical use of the scale. Evidence 
pointing in the same direction may be obtained from Table 44 
where the average of the A. D.'s for the first twenty-five papers 
rated by the judges is found to be 1.274, while the average of the 
A. D.'s for the last twenty-five papers is found to be 1.17. 

IV. The Hillegas Composition Scale 

There is less agreement among teachers as to what constitutes 
merit in composition than in any other subject, probably. This 
is seen in comparison with drawing by the fact that there are only 
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10 steps between merit and a practically perfect composition 
by the Hillegas scale, while there are 17 steps between merit and 
a practically perfect drawing by the Thorndike scale. In both 
these scales, a unit of difference is just that amount which is 
recognized by 75 per cent of the judges, and, therefore, the 
agreement as to what constitutes merit in drawing is far more 
general than in composition. This fact makes the production 
of a scale for general merit a most difficult thing, and also makes 
necessary the expectation of a high variability among unprac- 
ticed judges in rating compositions by the scale. 

In order to indicate just how variable are the standards among 
teachers at present as to what constitutes merit in compositions, 
and as to how much merit should be expected from children of 
different grades, ratings by three teachers were secured upon a 
set of thirty-one compositions, ratings of four other teachers 
were secured upon a set of forty-two compositions, and ratings of 
four other teachers upon another set of thirty-seven composi- 
tions. 1 The teacher first rated the papers by the common letter 
method which is in use in the schools of that city, A, B, C, and 
D being used to designate the four steps from best to poorest. 
These data are given in Table 48 for the three groups of judges. 

TABLE 48 

Giving the Common Letter Rating Upon Three Sets of Compositions 
bt Groups of Teachers Called Judges 

Mark Mark Mark Mark Total 

A B C D 

First Set of Papers 

Judge 1 4 6 7 14 31 

Judge II 5 7 5 14 31 

Judge III 5 10 6 10 31 

Second Set of Papers 

Judge 1 2 12 15 13 42 

Judge II 7 11 ? ? 

Judge III 10 10 14 8 42 

Judge IV 14 6 11 11 42 

Third Set of Papers 

Judge 1 19 5 8 5 37 

Judge II 9 14 9 5 37 

Judge III 10 8 9 10 37 

Judge IV 14 7 3 13 37 

1 These ratings were secured by W. H. Smith at East Orange, N. J., to 
whom my thanks are hereby expressed. 
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Table 48 calls for little comment. It is very evident that the 
teachers constituting each of the groups have no common idea 
of what is an A composition, or a B, C, or D. To discover some 
means of denning those marks is surely one of the greatest needs 
of educational administration. Afterwards the teachers were 
asked to arrange the set of compositions in order of merit from 
best to poorest. From this arrangement a rank number was 
given to each composition according to the position given it by 
each teacher . Needless to say that these positions were assigned 
by each teacher without the knowledge of the position assigned 
by previous teachers. The rank positions are listed in Table 49 
for each set of papers as ranked by each teacher, called judge. 

The only way to get an adequate notion of the difference be- 
tween the positions of the papers as ranked by one judge and as 
ranked by another is to examine the tables. There is a little 
danger, however, of one's being misled by the fact that since the 
three sets contain unequal numbers of papers, a given difference 
in rank does not mean the same for the three sets. On this 
account, and also to make possible definite comparison with 
other variations in rating, I calculated coefficients of correlation 
between the relative positions assigned by each judge with the 
relative positions assigned the same papers by each other judge. 
Thus for set 1, the relationships between the positions given by 
judge I and judge II, those given by judge I and judge III, those 
given by judge II and judge III, were expressed by these coeffi- 
cients of correlation. The same practice was followed for sets 
2 and 3, there being six coefficients found for each set. 

The method used for computing these coefficients was that of 
differences in relative positions or ranks, using the formula, 

r = sine — R, where R = l — -■—, 2g being the sum of the plus 

jJ 71 — 1 

differences in rank, and n being the number of cases, and deter- 
mining the r value by the use of Table 37 on page 169 of Thorn- 
dike's "Mental and Social Measurements" (ed. 1913). 
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TABLE 49 

Giving the Rank Positions of Compositions as Judged by Different 
Teachers: Thirty-One Papers in Group One, Rated by Three 
Teachers; Forty-Two Papers in Group Two, Rated by Four Teach- 
ers; Thirty-Seven Papers in Group Three Rated by Four Teach- 
ers. The Paper Considered BEST is Given Rank 1 
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These coefficients are given below: 

Set 1 
Between ranks Coefficients of 

given by Correlation 

Judges I and II 70 

Judges I and III 88 

Judges II and III 78 

Set 2 

Judges I and II 79 

Judges I and III 75 

Judges I and IV 65 

Judges II and III 77 

Judges II and IV 62 

Judges III and IV 78 

Set 3 

Judges I and II 53 

Judges I and III 62 

Judges I and IV 72 

Judges II and III 62 

Judges II and IV 66 

Judges III andIV 62 

In order to compare the coefficients derived by this method 
with those which would result from some other method, I used 
with set 2 the formula, 

r = 2 sine — p where p=l— ■ where D= differences in rank and n= 

6 w(w 2 -l) 

the number of cases. g 

I also calculated the Pearson coefficient to show the relation 
between the positions assigned the papers of the first set by judges 
I and II. These coefficients are given below: 

Set 2 

Judges I and II 80 

Judges I and III 76 

Judges I and IV 70 

Judges II and III 79 

Judges II and IV 71 

Judges III and IV 84 

Set 1 
Judges I and II 655 (Pearson coefficient) 

The average of all the coefficients listed above is a little less 
than .72. From this we may get some notion of the extent of 
the variation in standards among teachers regarding merit in 
composition work. 
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While we might expect considerable change of position among 
the compositions near the middle of the group, we are surprised 
to see, for example, the composition in set 3 which judge I con- 
siders the best, given rank 23 by judge IV. To determine the 
extent of agreement regarding the best compositions, I averaged 
the ranks given by the other judges of the group upon the five 
papers considered best by each judge. To illustrate, in set 1, 
judge I considered papers 12, 20, 3, 2, and 19 the five best. The 
average of his ranks on these five papers is, of course, 3. The 
average of the ranks assigned to the same five papers by the 
other two judges is 3.5. Carrying out a similar calculation for 
all the groups we have the following: 

The Papers Ranked Avg. Rank Among 

1, 2, 3, 4, and 5 by Other Judges 

Setl Judge I 3.5 

Judge II 5.4 

Judge III 3.3 

Avg. 4.07 

Set 2 Judge I 12.3 

Judge II 10.6 

Judge III 9.5 

Judge IV 8.6 

Avg. 10.25 

Set 3 Judge I 13.6 

Judge II 11.0 

Judge III 9.8 

Judge IV 10.4 

Avg. 11.2 

From this it appears that there is an average change of rank 

among the judgments on the best five papers in each group as 

follows: 

For Set 1 4.07 minus 3, or 1.07 

For Set 2 10.25 minus 3, or 7.25 

For Set 3 11.2 minus 3, or 8.2 

The average change of rank for all the papers for each set was 

found to be as follows : 

For Set 1 4.28 

For Set 2 6.66 

For Set 3 6.94 

From this it appears that the average change of rank among 
the best papers is much less than among the set as a whole in the 
case of set 1, but in the two other sets, the changes are greater 
among the positions of the best papers than among the papers 

9 
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as a whole. This is significant because we are led by psycholo- 
gists to believe that in any normal group we should find at the 
extremes of the distribution a small number whose ability should 
be easily distinguishable from the ability of the majority of the 
group. So far as these sets of papers go, either this assumption 
is not true, or else the teachers are not able to recognize this 
ability, as shown by the compositions making up sets 2 and 3. 

This brief experiment should serve to emphasize two points: 
The need of standardization in composition work, and the great 
difficulty in the way of such standardization. We shall now 
proceed to our examination of the Hillegas scale. 

In the following study of the Hillegas scale for English com- 
position, it must not be forgotten that it is but one phase of its 
usefulness which is being investigated. Our entire thesis pertains 
to variability among standards of teachers as shown in the mark- 
ing of students upon daily work and examinations. In the study 
of the Hillegas scale we confine ourselves to its availability as an 
objective measure by which variability of rating may be reduced, 
and to its responsiveness in locating the varying amounts of 
merit among the several papers to be marked by it. Its great 
value as a means of defining merit is not examined beyond these 
two points. 

The amount of variability among the marks given by many 
judges to the same paper is always in terms of the steps used in 
the scale of marking, and to compare the variability of the marks 
given by two different methods, it is necessary to equate the 
steps of the two scales. This is undertaken with the first group 
of data which follows. 

During the summer of 1913 under the direction of Professor 
Strayer, to whom I am very much indebted for permission to 
use the data, a set of twenty-eight seventh grade compositions 
written by the pupils in the schools of Baltimore County, Mary- 
land, were first rated by about sixteen teachers of the county, 
using the common marking system, to 100, with the step of 5 
as the unit. Thus the marks ran 60, 65, 70, 75, and soon. The 
same papers were then marked by about sixteen teachers from the 
same group, using the Hillegas scale. The ratings of each teacher 
were put upon the reverse side of the sheet containing the com- 
position, thus permitting each one to observe the ratings made 
by previous judges. The caution was given and without doubt 
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was conscientiously heeded, that each judge should have definitely- 
made up his mind what value he attached to the composition 
before turning over the paper, and that he should then put that 
value down regardless of how it differed from marks of previous 
judges. 

In the following table, No. 50, the distributions of marks given 
by the two methods of rating are given. 



TABLE 50 

Distribution of Ratings Upon Twenty-Eight Seventh Grade Compo- 
sitions Given by About Sixteen Teachers in Baltimore County, 
Maryland. The Common Percentage Method Was Used tor the 
First Rating, and the Hillegas Scale for the Second 
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In calculating the averages and average deviations recorded in 
the table some difficulties were encountered with the Hillegas 
scale distributions. The successive steps are not equal, but it 
was necessary to indicate the deviations in steps. The size of 
the successive intervals in terms of Median Deviations (one M. D. 
being that difference in merit which exists between two composi- 
tions when just 75 per cent of competent judges recognize the 
difference) is found by subtracting the value of each sample 
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from the value of the one next above it, and dividing by 100. 
The successive steps of the scale are found to be, 



1 . 83 Median Deviations 


. 90 Median Deviations 


.77 " 


.97 " " 


1.09 " 


.66 " " 


1.05 " 


.97 " " 


1.11 " 





In determining the average of the series of judgments assigned 
to a given paper, no account was taken of the difference in the 
sizes of steps except in the case of the step where the average 
fell. The method of deviations from the guessed average was 
used, and the steps of deviation all counted equal. However, 
when the location within the step was determined for the true 
average, the actual difference or size of step was used. For ex- 
ample, in the case of paper 1, the deviations above the guessed 
average were found to be 6, while the deviations below were 3, 
thus making a difference of 3. There were seventeen judgments, 
therefore the true average was calculated as 3/17 of the difference 
between 5.85 and 6.75, above 5.85, or at 6.01. In calculating 
the A. D., however, the 3/17 was counted simply as a fraction 
of a step, and the total steps of deviation determined by adding 
to 3 plus 6, the product of 8 plus 3 minus 6, and 3/17. Thus 
9 15/17 divided by 17 gives the A. D. as .58 steps. 

It cannot be claimed that this method of finding either the 
average or the average deviation is precisely correct. Since, 
however, the value of the study does not hinge upon the absolute 
accuracy of either measure, but upon the average of the 28 aver- 
ages and the average of the 28 average deviations, it is believed 
that an error in one direction with one, will be balanced by an 
error in the opposite direction with another, and so in the end 
substantially the same results will be obtained as if a much more 
elaborate method had been used. 

Turning now to the problem of equating a step of the Hillegas 
scale with steps of the other scale (called for want of a better 
designation, the percentage scale) several alternatives presented 
themselves. The whole number of places on the one could be 
called equal to the whole number of places on the other, thus mak- 
ing, 

1 step, Hillegas scale, equal 10 steps, percentage scale. 
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This seemed unfair because the range on the percentage scale 
used by the teachers was near the top, while not a judgment by 
the Hillegas scale was at either 8.38 or 9.37. Similarly the custom 
of rarely grading papers below 50 shuts off the use of the lower 
half of the percentage scale. 

Another alternative was to take the range between the highest 
mark and the lowest mark given to any paper in the group by 
both methods of rating, and equate those ranges. This alterna- 
tive was not adopted because it seemed to give undue weight to 
extreme judgments. 

The method which seemed fairest was to equate the ranges 
between the average of the five lowest papers and the average 
of the five highest papers found by both scales. This is the 
method used in the following calculations. 

From Table 50 we find the five lowest papers as judged by 
the Hillegas scale are Nos. 21, 26, 12, 23 and 22. Their values 
by the two methods of rating constitute the first part of Table 
51. The derivation of the other portions of the table will be 
apparent upon inspection. 



TABLE 51 

Giving the Five Lowest and Five Highest Papers as Found by Each 
Method op Rating, and the Corresponding Values Given the 
Same Papers bt the Other Method 



Lowest Papers bt 
Hillegas Scale 



No. 21, 
No. 26, 
No. 12, 
No. 23, 
No. 22, 

Averages 



4.18 
4.35 
4.54 
4.61 

4.87 

4.51 



Correspond- 
ing Values bt 
Percentage 
Scale 

60.9 

74.06 

72.4 

72.35 

63.45 

68.63 



Lowest Papers bt 
Percentage Scale 



No. 21, 
No. 22, 
No. 9, 
No. 8, 
No. 13, 



60.9 

63.45 

64.7 

69.4 

71.8 

66.05 



Correspond- 
ing Values bt 
Hillegas 
Scale 

4.18 
4.87 
5.11 
5.29 
5.49 

4.99 



Correspond- 
Highest Papers bt ing Values by Highest Papers by 
Hillegas Scale Percentage Percentage Scale 
Scale 



No. 14, 


7.4 


No. 28, 


6.96 


No. 6, 


6.75 


No. 16, 


6.6 


No. 19, 


6.48 



92.9 

92.35 

92.1 

83.55 

80.0 



No. 614, 
No. 28, 
No. 6, 
No. 16, 
No. 5, 



92.9 

92.35 

92.1 

83.55 

86.1 



Correspond- 
ing Values bt 
Hillegas 
Scale 

7.4 

6.96 

6.75 

6.6 

6.41 



Averages 



6.84 



88.18 



89.40 



6.82 
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From the above table the following ranges appear: 

1st. Lowest five Hillegas scale to highest five same scale, 6.84 
less 4.51, or 2.33 steps. 

2nd. Corresponding values, percentage scale, 88.18 less 68.63, 
or 19.55. 

3rd. Lowest five to highest five, percentage scale, 89.40 less 
66.05, or 23.35 steps. 

4th. Corresponding values, Hillegas scale, 6.82 less 4.99, or 
1.83 steps. 

From the three methods of equating possible from the above 
ranges, we get the following: 

Equating the first and second, we have 1 step Hillegas scale 
equals 8.39 steps percentage scale. 

Equating the first and third, we have 1 step Hillegas scale 
equals 10.02 steps percentage scale. 

Equating the third and fourth, we have 1 step Hillegas scale 
equals 12.76 steps percentage scale. 

The average of these three values, 8.39, 10.02, and 12.76 is 
10.39, and this is taken to be a fair value of the step in the Hille- 
gas scale in units of the percentage scale. If this be admitted 
as fair, we may now proceed to compare the variabilities existing 
with the two methods of rating. 

The lists of average deviations for each paper with each method 
of rating is given in Table 50 under A. D. The averages of these 
two sets of A. D.'s are 5.08 steps for the percentage scale, and 
.722 steps for the Hillegas scale. Reducing the latter to its 
equivalent in units of the percentage scale by multiplying ib by 
10.39, we have the variability of the judgments by the Hillegas 
scale represented by an average deviation of 7.50 units of the 
percentage scale. This, it will be observed, is very much larger 
than the average deviation of the judgments given the same 
papers by the percentage method. 

A proper correction for the coarseness of grouping in both 
distributions would operate to reduce the average deviation found 
for the Hillegas scale more than for the other one. The number 
of steps in each distribution of judgments by the Hillegas scale 
is, on the average, a little less than 4, while the number of steps 
on the percentage scale (each step being 5 units) is, on the aver- 
age, a little less than 6. From the table given on page 55 of 
Thorndike's "Mental and Social Measurements," it seems fair 
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to subtract .04 of a step from the A. D. found for the Hillegas 
scale distributions, and .02 of a step from the A. D. found for the 
percentage scale distributions. This makes the corrected A. D.'s 
.682 steps on the Hillegas scale, and 4.98 units on the percentage 
scale. (The correction, which is .02 steps, must be multiplied by 5, 
the number of units in the step, and this product subtracted from 
5.08, leaving 4.98.) Reducing the corrected A. D. for the Hille- 
gas scale to its equivalent in percentage units, we have 7.08. 
This still seems surprisingly large in comparison with 4.98, the 
average deviation for the percentage ratings. 

The conclusion arrived at in the above study pointed to the 
necessity for further study of the workings of the Hillegas 
scale. To meet this necessity, the following data were gathered. 

The same twenty-eight compositions whose ratings were given 
in Table 50 were rated by a class of graduate students in Teachers 
College under the direction of Professor Strayer. The papers 
were passed to the students who had each a copy of the Hillegas 
scale. Each one graded the composition in his hands, placing 
the mark on the reverse side of the paper. On signal, the papers 
were passed ^long and graded again. The caution was again 
urged that each one should have made up his mind definitely 
what mark the paper deserved before turning it over, and that 
the mark should not be changed no matter how far it differed 
from the marks previously recorded. It seems fair to assume 
that students so interested in education would be able to follow 
this suggestion. 

The papers were passed until each had been marked by six- 
teen judges. (There were three people who left before the six- 
teenth judgment was made, thus leaving three papers with but 
fifteen judgments. These papers are Nos. 12, 25 and 27.) These 
sixteen successive series of judgments are recorded in Table 52, 
page 120, as well as the tables of frequency for each paper. 
The papers are in the same order as given in the previous table, 
No. 50, so that comparisons of ratings by the two sets of judges 
may be made. 

We note from this table that the variability is greater with 
this set of judges than it was with the Baltimore County teachers. 
It will be observed also that the average of the averages is greater 
by .18 of a step. It is interesting, furthermore, and rather signif- 
icant, that the averages of the successive series of judgments on 
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the whole set of papers varies from a maximum of 6.17 in the 
fourth, fifth and fifteenth judgments, to a minimum of 5.48 in 
the second judgment. If such variation is typical among sup- 
posedly competent judges, it cannot be held that the scale is 
very satisfactory as an objective measure, in the hands of un- 
practiced judges. 

Further investigation of the scale was made by examining 
the ratings upon a set of twenty-eight fifth grade compositions 
which were marked by about sixteen teachers in Baltimore 
County, Maryland, and again by the same number of graduate 
students of Teachers College. The tables of frequency for both 
sets of judges are given in Table 53. 



TABLE 53 

Distribution op Two Sets of Judgments by the Hillegas Scale Given to 
Twenty-Eight Fifth Grade Compositions by About Sixteen Baltimore 
County Teachers, and Later by About Sixteen Graduate Students of 
Teachers College 





Baltimore County Teachers 










Graduate Students 




Papers 


1.8S 1.60 S.69 4.74 5.85 6.75 7.72 


Arg. 


A.D. 


1.83 1. 60 S.69 4.74 5.85 6.75 7 


74 


Atg. 


A.D. 


1 


1 


4 


5 


4 


3 




4.94 


1.03 




3 


3 


5 


4 


1 


5.64 


.98 


2 






7 


5 


4 




5.64 


.71 






8 


7 


1 




5.36 


.56 


3 


1 


7 


6 


1 






4.19 


.63 




6 


7 


3 






4.54 


.61 


4 


3 


8 


4 








3.76 


.50 


1 


7 


6 








4.06 


.55 


5 




5 


8 


2 






4.53 


.52 


1 


2 


13 








4.48 


.41 


6 






7 


7 


1 




5.41 


.56 




3 


6 


6 


2 




5.20 


.79 


7 


5 


6 


3 


1 






3.69 


.62 


1 


7 


6 


1 


1 




4.35 


.75 


8 




1 


9 


3 


2 


1 


5.30 


.83 






7 


7 


2 




5.51 


.60 


9 




4 


9 


2 


1 




4.74 


.50 


1 


3 


7 


3 


2 




4.88 


.79 


10 




1 


4 


5 


5 


1 


5.91 


.83 


1 




3 


7 


5 




6.78 


.72 


11 


1 


5 


6 


3 


1 




4.61 


.78 




4 


9 


3 






4.68 


.47 


12 






4 


6 


5 


1 


6.02 


.73 




1 


5 


3 


7 




5.85 


.88 


13 


1 


8 


§ 


1 






4.15 


.62 


1 1 


11 


2 




1 




3.82 


.66 


14 








7 


9 




6.35 


.49 




2 


4 


5 


4 


2 


5.85 


.95 


15 








4 


9 


3 


6.69 


.47 






1 


3 


11 


1 


6.52 


.50 


18 








2 


12 


2 


6.75 


.25 




2 


6 


2 


6 




5.57 


1.00 


17 






12 


3 


1 




5.08 


.47 




5 


6 


4 


1 




4.81 


.71 


18 




5 


9 


1 


1 




4.61 


.55 




1 


7 


4 


4 




5.37 


.81 


19 


1 


6 


8 


1 






4.27 


.62 




6 


8 


1 


1 




4.54 


.55 


20 


1 


2 


3 


8 


2 




5.30 


.87 


1 


4 


4 


4 


4 




5.13 


1.08 


21 




4 


7 


4 


2 




5.00 


.77 




4 


7 


3 


2 




4.95 


.76 


22 




4 


6 


4 


2 




5.02 


.81 




1 


5 


5 


5 




5.71 


.78 


23 




1 


1 




10 


4 


6.79 


.59 




1 


1 


5 


8 


1 


6.25 


.76 


24 






6 


6 


4 




5.71 


.66 




3 


3 


8 


2 




5.37 


.79 


25 




3 


10 


2 






4.67 


.37 






5 


4 


7 




5.96 


.77 


26 


3 


5 


5 


2 






4.11 


.77 




5 


9 


2 






4.54 


.51 


27 






1 


11 


4 




6.02 


.41 


1 




2 


8 


3 


2 


5.96 


.79 


28 
Averages 


3 


5 


7 








3.97 
5.115 


.68 
.634 




6 


8 


2 






4.48 
5.184 


.56 
.718 



The averages of the 28 A. D.'s in the case of the Baltimore 
County teachers and the graduate students, respectively, are 
.634 and .718. Here again it will be observed that the graduate 
students vary more in their judgments than do the teachers in 
the field, although both groups vary less with this fifth grade 
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set than with the seventh grade set previously examined. This 
improvement cannot be accounted for by practice either, because 
it was a different set of judges from those who marked the other 
set. The difference is probably mere chance, or due possibly 
in part to a difference in the nature of the two sets of papers 
which makes one set a little more readily comparable with the 
scale than the other set. 

The average rating upon the set by the Baltimore County 
teachers is 5.115 and by the graduate students, 5.184, a difference 
of less than .07 of a step. The remarkable thing, however, is 
that the difference between the fifth grade set and the seventh 
grade set is less than .6 of a step according to either set of judges. 
It will be observed that the average judgments of the two groups 
of judges on the seventh grade papers differ from each other 
nearly one third as much as either one differs from the average 
of the fifth grade papers. Furthermore, the average variation 
among even the least variable group of judges is seen to be 
more than the difference between the average values assigned 
to the two sets of papers. In other words, the variability in 
steps of the Hillegas scale is nearly twice as great as the difference 
between the average rating on the fifth grade set and the average 
rating on the seventh grade set. Or, half the judges, roughly 
speaking, varied in their judgment on any paper from the aver- 
age judgment of the group, by more than the difference between 
the averages of these two sets of papers. This comparison 
serves, of course, to point out the slight improvement in com- 
position work between this particular fifth grade and seventh 
grade quite as well as to indicate the extent of variability among 
the judgments. Incidentally, it may be remarked, that this very 
service would be impossible even to this rough inexactness without 
such a scale for measuring the improvement from grade to grade. 

Another phase of variability which it is hoped the scale may 
help to decrease is the variability among the averages of two or 
more groups of judges upon the same paper. If this decrease is 
accomplished by the scale, then a supervisor may secure a reliable 
measure of progress by having several judges rate each paper, 
even if it were not possible to trust a single judgment. 

To determine the extent of this agreement the following table, 
No. 54, was constructed showing the difference between the 
averages of two groups of judges upon the same paper. For the 
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seventh grade set the average of these differences is seen to be 
.632 steps of the scale, with five differences greater than one 
step. Twenty-five per cent of the differences are less than .34 
and 25 per cent are greater than .84. It will be noted further that 
the average deviation from the average of the column of average 
judgments rendered by the graduate students is .55 and for the 
Baltimore County teachers the same deviation is seen to be .66. 
Taken together these two figures average less than .632 which is 
the average of the differences. This signifies that the difference 
between the averages of two groups of judges upon each paper 
in this set of papers was greater than the average variation among 
the judgments given to the different papers in the set. 



TABLE 54 

Differences in Fractions of a Step Between the Average Rating Upon 
a Composition by One Set of About Sixteen Judges, and the Aver- 
age Rating upon the Same Composition by Another Group of About 
Sixteen Judges 

The set of seventh grade compositions on the left (from Tables 50 and 52) 
and the set of fifth grade compositions on the right (from Table 53). 





Seventh Grade Set 






Fifth Grade Set 






■o -3 
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tt 




-s -3 


•» ■§ 


<u 


s. 

5 


|a8 
o a g 
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O 09 Si 


1 


1 


5.96 


6.01 


.05 


1 


4.94 


5.64 


.70 


2 


6.46 


5.02 


1.44 


2 


5.64 


5.36 


.28 


3 


5.71 


5.63 


.08 


3 


4.19 


4.54 


.35 


4 


6.13 


6.47 


.34 


4 


3.76 


4.06 


.30 


5 


6.93 


6.41 


.52 


5 


4.53 


4.48 


.05 


6 


6.69 


6.75 


.06 


6 


5.41 


5.20 


.21 


7 


6.13 


5.85 


.28 


7 


3.69 


4.35 


.66 


8 


4.86 


5.29 


.43 


8 


5 30 


5.51 


.21 


9 


5.23 


5.11 


.12 


9 


4.74 


4.88 


.14 


10 


6.25 


5.08 


1.17 


10 


5.91 


5.7S 


.13 


11 


5.36 


6.10 


.74 


11 


4.61 


4. 68 


.07 


12 


5.71 


4.54 


1.17 


12 


6.02 


5.85 


.17 


13 


6.30 


5.49 


.81 


13 


4.15 


3.82 


.33 


14 


6.93 


7.40 


.47 


14 


6.35 


5.85 


.50 


15 


5.71 


6.42 


.71 


15 


6.69 


6.52 


.17 


16 


6.19 


6.60 


.41 


16 


6.75 


5.57 


1.18 


17 


5.57 


6.07 


.50 


17 


5. OS 


4.81 


.27 


18 


6.25 


5.57 


.68 


18 


4.61 


5.37 


.76 


19 


7.05 


6.48 


.57 


19 


4.27 


4.54 


.27 


20 


4.68 


5.29 


.61 


20 


5.30 


5.13 


.17 


21 


4.35 


4.18 


.17 


21 


5.00 


4.95 


.05 


22 


5.96 


4.87 


1.09 


22 


5.02 


5.71 


.69 


23 


5.29 


4.61 


.68 


23 


6.79 


6.25 


.54 


24 


6.08 


5.37 


.71 


24 


5.71 


5.35 


.36 


25 


6.27 


5.40 


.87 


25 


4.67 


5.96 


1.29 


26 


5.19 


4.35 


.84 


26 


4.11 


4.54 


.43 


27 


4.53 


6.22 


1.69 


27 


6.02 


5.96 


.06 


28 


6.47 


6.96 


.49 


28 


3.97 


4.48 


.51 


Averages 


5.87 


5.69 


.632 




5.115 


5.184 


.38< 


A.D. 
















from Avg 


.55 


.66 






.76 


.53 




Middle 50?? 






.34 to .84 








.17 to .50 
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In the case of the fifth grade set the differences are not so great, 
being on the average .384 steps, with 25 per cent of the cases 
below .17 steps and 25 per cent above .50. If we consider the 
average of the differences found in both groups of papers we find 
it just slightly above .50 steps. This means that if a child's 
composition be rated by one set of sixteen judges, and then by 
another set of sixteen judges (assuming that graduate students 
and Baltimore County teachers are typical judges, and that these 
two sets of papers are typical papers) the chances are one to one 
that the mark will be raised or lowered by one-half step or more. 

As a further measure of this agreement between the two sets 
of judges on the same papers, the coefficients of correlation by 
the method of unlike signed pairs using the average as the central 
tendency, were determined between the series of averages, and 
the following results were obtained: 

Seventh grade set : 

Baltimore County teachers and graduate students, 

both using the Hillegas scale 33 

Baltimore County teachers using the Hillegas scale 
and same teachers using the percentage scale 78 

Fifth grade set: 

Baltimore County teachers and graduate students, 
both using the Hillegas scale 85 

It appears from these figures that there is no consistent uni- 
formity in the averages of different groups of judges on the same 
paper, the coefficient in the case of one set being quite high, but 
in the case of the other set being quite low. 

One more sort of check seems important. Will the same per- 
son rating a paper the second time after the lapse of several 
days tend to rate more consistently than two separate judges? 
To secure the answer to this question a different method of treat- 
ment was necessary from that used previously in the composi- 
tion study, although similar to that used in the handwriting 
study. No judge can render many valuable ratings on the same 
paper because the element of familiarity with the composition 
soon prejudices his judgment. Consequently, in order to make 
the most use of two judgments by each judge, it was decided to 
calculate the difference between each judgment and each other 
judgment given the same paper by several judges and then com- 
pare with the average of these differences the average difference 
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between a judge's first judgment and his second. For this purpose 
the data secured during the summer session of 1913 at Teachers 
College by Mr. R. 0. Runnells were given to me. Twenty- 
three papers were rated by four judges, and then rerated by the 
same judges after several days. The first and second judgments 
on these papers are recorded in Table 55, and the differences in 
judgments in terms of steps of the Hillegas scale are given in 
Table 56, page 126. 

TABLE 55 

Judgments of Four Graduate Students of Teachers College Upon 
Each of Twenty-Three English Compositions by Means of the 
Hillegas Scale 

A second series of judgments taken several days later by the same judges 
are recorded for purposes of comparison 



Papers 


Judge I 


Judge II 


Judge III 


Judge IV 




1st 


2nd 


1st 


2nd 


1st 


2nd 


1st 


2nd 


1 


.... 7.72 


7.72 


9.37 


7.72 


7.72 


6.75 


6.75 


7.72 


2 


.... 7.72 


7.72 


8.38 


9.37 


8.38 


8.38 


9.37 


9 37 


3 


.... 7.72 


7.72 


4.74 


8.38 


7.72 


6.75 


5.85 


7.72 


4 


.... 7.72 


7.72 


6.75 


8.38 


7.72 


7.72 


7.72 


6.75 


5 


.... 7.72 


7.72 


6.75 


8.38 


6.75 


6.75 


7.72 


6.75 


6 


.... 7.72 


7.72 


7.72 


9.37 


8.38 


8.38 


7.72 


8.38 


7 


.... 7.72 


6.75 


8.38 


8.38 


7.72 


7.72 


6.75 


none 


8 


7.72 


6.75 


9.37 


8.38 


8.38 


8.38 


7.72 


7.72 


9 


.... 6.75 


6.75 


7.72 


7.72 


6.75 


7.72 


7.72 


7.72 


10 


.... 7.72 


6.75 


7.72 


6.75 


5.85 


5.85 


5.85 


5.85 


11 


.... 6.75 


6.75 


8.38 


6.75 


6.75 


6.75 


5.85 


6.75 


12 


.... 6.75 


6.75 


5.85 


6.75 


5.85 


5.85 


6.75 


6.75 


13 


.... 7.72 


6.75 


8.38 


9.37 


7.72 


7.72 


5.85 


6.75 


14 


.... 7.72 


6.75 


7.72 


6.75 


6.75 


6.75 


5.85 


5.85 


15 


.... 6.75 


6.75 


6.75 


8.38 


7.72 


6.75 


5.85 


5.85 


16 


.... 7.72 


7.72 


9.37 


9.37 


8.38 


8.38 


8.38 


8.38 


17 


.... 6.75 


6.75 


5.85 


6.75 


6.75 


5.85 


6.75 


7.72 


18 


.... 6.75 


6.75 


7.72 


6.75 


6.75 


7.72 


7.72 


7.72 


19 


.... 6.75 


6.75 


8.38 


6.75 


5.85 


5.85 


5.85 


5.85 


20 


.... 6.75 


6.75 


4.74 


6.75 


6.75 


5.85 


6.75 


5.85 


21 


.... 6.75 


6.75 


6.75 


7.72 


7.72 


7.72 


6.75 


6.75 


22 


.... 6.75 


6.75 


6.75 


6.75 


5.85 


5.85 


7.72 


7.72 


23 


.... 6.75 


6.75 


8.38 


6.75 


6.75 


6.75 


7.72 


6.75 



No clearer evidence of the varying efficiency among judges in 
the use of the scale could be asked than these tables show. Note, 
for example, that judge I uses no other points on the scale in 
either his first or his second judgment than 6.75 and 7.72. No 
wonder that his first and second judgments show slight differ- 
ences. On the other hand, judge II, while ranging all the way 
from 4.74 to 9.37, uses such slight discrimination that there is 
scarcely any relation between his first and second judgment. 
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TABLE 56 

Differences Among the Marks of the Four Judges Whose Ratings 
Are Recorded in Table 55, in Terms of Steps on the Hillegas Scale 
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2.22 
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.59 



(The coefficient of correlation by the unlike signed pairs method, 
using the median as the central tendency, is only .16.) In the 
case of judge III we see a combination of a reasonably wide range 
of marks with a fair consistency between his successive judgments. 

The type of composition seems to have no bearing upon the 
amount of difference among the judgments. That is, it does not 
follow that where the judges differ widely one from the other, 
the judges will likewise differ widely from their previous judg- 
ments. The correlation is negative, in fact, between the column 
of "averages of the 12 differences," and the column "average" 
of differences between each judge's mark and his own previous 
mark. 

All of the compositions whose ratings have been examined thus 
far have been those written by elementary school children. 
The following data are submitted as evidence that the relative 
variability of marking by the regular percentage method and 
by the Hillegas scale is not much different when high school 
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papers are used, and when the rating is done by high school 
teachers from all the departments of the high school. A group 
of 24 papers written by the members of a class in English in the 
Columbus, Ohio, High School, 1 were rated by ten teachers in the 
same high school on the regular basis of 100. The papers were 
then given to the same teacher to be rated by the Hillegas scale, 
with the instructions that they give to each paper the value on 
the scale assigned to the composition which they considered most 
nearly equal to it in merit. Both groups of ratings are given in 
Table 57, page 128. 

It will be observed from Table 57 that the average deviation 
among this group of teachers on these high school papers is 
greater than the average deviation found for any of the elemen- 
tary school papers by both the percentage method and the 
scale method of rating. With these high school papers the 
average of the A.D.'s for the percentage method of rating is 
6.46 units of the scale, and the average of the A. D.'s for the 
scale method of rating is .875 steps of the scale. If we equate 
the steps of the Hillegas scale with units of the percentage scale 
by simply calling equal the range between the average scores of 
the three lowest papers and the average scores of the three 
highest papers by the two methods of rating, we get one step of 
the Hillegas scale equal to 9.49 units of the percentage scale. 
If we now reduce the average deviation found for the Hillegas 
scale ratings to its equivalent value in units of the percentage 
scale by multiplying .875 by 9.49, we get 8.30 as the value in 
percentage scale units of the average deviation by the Hillegas 
scale method. This, it will be observed, is considerably larger 
than the average deviation by the percentage method, that 
figure being 6.46. 

It seems unnecessary to enter into any extended study of this 
table, since it corresponds in essential respects with what has 
already been pointed out in the previous tables. One feature, 
however, is deserving of note. The average of each teacher's 
ratings on all the papers is given at the bottom of the table. 
From these averages we may see that while there is still a large 
range of variation by the scale method, there is a larger range 
by the percentage method. This indicates that the scale has 

x The data for this study were procured by Principal A. W. Castle of Colum- 
bus, Ohio, to whom my thanks are hereby expressed. 
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tended to equalize the standards among the teachers, even 
though they have varied in their ratings by it more than they 
varied without it. This may best be seen from a comparison 
of the deviations from the average of these two lists of average 
ratings. By the percentage method the average of the ten 
averages at the bottom of the table is 85.0 and the average de- 
viation from the average is 4.46 units of the scale. In the case 
of the Hillegas scale ratings, the average of the averages at the 
bottom of the table is 7.21, and the average deviation is .327. 
(This is not in terms of steps of the scale exactly, but is nearly 
enough for practical purposes, since the step of the scale in 
which this falls extends from 6.75 to 7.72, nearly 1 P. E.) If 
now we multiply this latter A. D. by 9.49 we get 3.1 as the A. D. 
among teachers' averages by the Hillegas scale as compared with 
an A. D. of 4.46 by the percentage method. 

In the light of the above findings it is pertinent to examine 
the derivation of the Hillegas scale to find out whether from its 
very nature we must expect as wide variability as we have found. 
In the two criticisms of the scale made by F. W. Johnson in the 
School Review of January, 1913, we have a hint that it must 
always be found impossible to compare one composition as a 
whole with each one of a variety of others. He found among 
the judgments of high school teachers of composition as well 
as members of a graduate class in educational tests, a wider 
variability in rating compositions of high school students than 
is revealed in this study in the rating of fifth or seventh grade 
papers. He found furthermore a very considerable difference in 
the average ratings of these two groups of judges upon the same 
composition. Is there a fundamental reason for this? 

In deriving the scale Dr. Hillegas used as the unit of difference 
in merit that amount which was recognized by 75 per cent of 
competent judges. The range from merit to the highest, 9.37, 
represents 9.37 of those units of difference. For the sake of ease 
in discussion I shall speak of the scale as if it contained 10 steps 
of equal length extending from merit to 10, and the samples 
of composition standing at the successive steps of the scale I 
shall designate by their values as 0, 1, 2, etc. Suppose that a 
composition of value 6 is being rated by 100 judges. We must 
expect twenty-five of the judges to rate the paper at 5 or less, 
and twenty-five to rate the paper at 7 or more. If the judgments 
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distribute themselves according to the normal surface of fre- 
quency, then 2.5 judges will call the paper 9 or better, 6.5 judges 
will call it 8, 16 judges will call it 7, 16 will call it 5, 6.5 will call it 
4, and 2.5 will call it 3 or worse. With this distribution which the 
derivation presupposes, we have an average deviation from the 
average of .73 steps of the scale. This, it will be recalled, is 
larger than most of the A. D.'s actually found with the fifth and 
seventh grade papers, although smaller than those found with 
high school papers. It will be found, no doubt, that in the 
case of a paper possessing marked individuality the ratings 
will vary even more widely. For example, three seventh grade 
compositions which were written with the view of meeting the 
criticisms of the children's classmates when read orally, were 
rated by fifty graduate students in Teachers College. The dis- 
tributions of judgments on the three papers are shown in 
Table 58, where the average deviations are seen to be practi- 
cally 1.00 on each paper. 

TABLE 58 

Tables of Frequency op Judgments of Fifty Graduate Students of 
Teachers College Upon Each of Three Compositions Which 
Possessed Strong Individuality. Hillegas Scale Was Used 



Papers 


260 


369 


474 


585 


675 


772 


838 


935 Avg. 


A. D. 


I 




1 


5 


7 


14 


19 


3 


1 6.90 


.97 


II 






1 


8 


11 


15 


11 


4 7.51 


1.02 


III 






3 


9 


10 


17 


8 


3 7.27 


1.07 



It is thus seen that the distributions obtained in this study show 
rather less than normal variability for unpracticed judges (nor- 
mal being determined by the variability of the judges whose 
ratings entered into the makeup of the scale). The very effort 
to define general merit in so complex a thing as a composition by 
a single example seems to make great variation unavoidable. 
It does not seem strange, in fact, that a general concept of merit 
which is standard for seventh grade, for example, should become 
with practice more uniform in the minds of teachers than this 
definition of merit by a single illustrative composition can be 
expected to be. An adequate definition of anything so complex 
as merit in composition work must include separate definitions 
of merit for the several elements which go to make up the com- 
position. In other words, before we may hope to define merit 
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so distinctly that teachers will vary little in rating compositions 
by the definition, we must have a series of scales each one devised 
for the measurement of a certain feature or phase of merit. 

The objection will undoubtedly be raised to the above sugges- 
tion that when separate scales are prepared for the standardiza- 
tion of the several elements which enter into the merit of a com- 
position, the scheme will have become so complex that no teacher 
will have the courage to use the standards. In answer to this 
objection may it not be well to consider the legitimate use of 
standard scales? The Hillegas scale was derived on the basis 
of general merit of compositions. Before such a scale can be 
applied successfully by teachers they must have a clear concep- 
tion of the relative values among the elements which constitute 
merit. It is certain that a single scale of merit cannot give them 
this conception. It is for the purpose of defining this merit in 
terms of its various elements that several scales, each based upon 
a different element, are necessary. With these scales before 
her she can check up her own concept of merit. Rating composi- 
tions is of necessity a subjective process, and the value of a 
series of scales in composition must be for purposes of definition 
of the marks to be used, in the mind of the one doing the rating. 
To measure a set of papers by placing them beside the scale 
should be a rare exercise of any teacher. 

Thus it seems that the simplicity of the Hillegas scale, which 
commends itself to us so highly, tends to make the operation of 
rating papers by it very easy, but at the same time ineffective. 
If we assume that the chief value of scales is the standardization 
of the concept of merit held by the teaching body, we shall not 
be afraid of a sufficient degree of complexity to make the scales 
effective. 

In all of this discussion of the Hillegas scale we have not taken 
account of the effect of practice with the scale. The judges whose 
ratings enter into the derivation of the scale, as well as all the 
teachers whose ratings are recorded in these tables, have had no 
practice with the scale. We have no evidence as to how much a 
group of teachers would decrease their variability by persistent 
use of the scale. Such evidence is sorely needed. We have a 
little evidence in Tables 52 and 56 pointing strongly in the direc- 
tion of great gain by practice. In Table 52 we have the sixteen 
series of judgments by the twenty-eight judges. If we compare 
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the deviations of the first eight with the deviations of the last 
eight, we find (using simply the averages at the bottom of the 
table) the average variation of the first eight from the average, 
to be .21 as compared with .12 for the last eight. Similarly, we 
have in Table 56 the average of the differences between judg- 
ments the first time the judges use the scale, and also the differ- 
ence between judgments of the same judges with the same papers 
the second time they use the scale. In the case of the first series 
of judgments the average difference between judgments by differ- 
ent judges is 1.03 steps of the scale, while with the second series 
of judgments it is .83 steps. It seems probable then that a con- 
siderable'^ amount of the variability will be removed when the 
judges are practised in the use of the scale. 



CONCLUSIONS 

1. A given grade or mark means many widely different things 
to different teachers when they are rating pupils for promotion. 
As measured by the achievement of the several school groups 
in their later work this difference amounts in some cases to as 
much as the difference between a G (good) and F- (fair minus) 
in elementary schools where the basis of marking includes only 
the steps P, F, G, and E. In high schools there is enough differ- 
ence between the standards of schools as wholes that, measured 
by the achievement of the school groups in later school work, a 
mark of 70 in one school means more than a mark of 81 in another 
school having the same passing standard by points. Within 
the high school and within the college the percentage of pupils 
which the various instructors fail as a common practice extending 
over several years varies from to 28, or more. 

2. In rating examination papers very great differences of stand- 
ards appear among supposedly equally competent judges. Refer- 
ences to the tables given in the text must be made to determine 
the extent of this variation for the several subjects and among 
the several groups of teachers. In the Regents' examinations 
for New York where only 25 per cent of the papers fall at 75 or 
above on the scale, and where the passing mark is 60, the state 
examiners change one fourth of the teachers' marks by 10 points 
or more, another fourth by from 5 to 10 points, and the remain- 
ing half by less than 5. On the whole, the state examiners fail 
nearly as large a percentage of the papers which the teachers pass 
as the teachers fail of all the papers written. 

3. The effort on the part of Courtis to standardize the ability 
to do single combinations in arithmetic in the upper grades is a 
bad educational policy. Probably no uniform test in arithmetic 
should be given to all ages of pupils. 

4. Rating of papers by means of statistically derived scales 
when the judges are unpractised in the use of the scales but ex- 
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perienced in marking by the common methods, produces different 
results for different subjects. In drawing, the variability is 
greatly reduced by the use of the scale. In handwriting, the 
variability is about equal with and without the scale. In com- 
position, the variability is somewhat greater with the scale than 
without it. 
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